Linear-Gompertz Model-Based Regression of Photovoltaic Power Generation by Satellite Imagery-Based Solar Irradiance

: A simple yet accurate photovoltaic (PV) performance curve as a function of satellite-based solar irradiation is necessary to develop a PV power forecasting model that can cover all of South Korea, where more than 35,000 PV power plants are currently in operation. In order to express the nonlinear power output of the PV module with respect to the hourly global horizontal irradiance derived from satellite images, this study employed the Gompertz model, which is composed of three parameters and the sigmoid equation. The nonphysical behavior of the Gompertz model within the low solar irradiation range was corrected by combining a linear equation with the same gradient at the conjoint point. The overall ﬁtness of Linear-Gompertz regression to the 242 PV power plants representing the country was R 2 = 0.85 and nRMSE = 0.09. The Gompertz model coe ﬃ cients showed normal distributions and equivariance of standard deviations of less than 15% by year and by season. Therefore, it can be conjectured that the Linear-Gompertz model represents the whole country’s PV system performance curve. In addition, the Gompertz coe ﬃ cient C, which controls the growth rate of the curve, showed a strong correlation with the capacity factor, such that the regression equation for the capacity factor could be derived as a function of the three Gompertz model coe ﬃ cients with a ﬁtness of R 2 = 0.88.


Introduction
Global warming is projected to increase by 1.5 • C between 2030 and 2052. Therefore, public health, human security, and economic growth are expected to face a major threat at the global level in the relatively near future [1]. In South Korea, greenhouse gas emissions from the energy sector, which accounted for 87% of total emissions in 2014, declined by 1.2% compared with the previous year due to the implementation of an energy transition strategy in which fossil fuel power plants are being replaced by renewable energy [2,3].
The implementation of the energy transition plan "Renewable Energy 3020", which was announced in 2017, will increase the renewable energy share of the energy mix from its current level of 7% to 20% by 2030, thus providing a new power capacity of 48.7 GW. "Renewable Energy 3020" has become the foremost driver toward the installation of photovoltaic (PV) systems [4]. As a result, South Korea has been ranked among the top 10 PV markets in terms of cumulative capacity, having reached 7.2 GW in 2018, and there are plans to install further 30.8 GW by 2030 [5].
Once renewable power generation accounts for over 20% of electricity generation, the stability of the electricity grid will be an important issue due to the volatility of solar and wind energy. In order to resolve this problem, renewable energy forecasting has been proposed as a major solution for stable management of the electricity [6]. Especially for long-term forecasting, first, solar irradiation should be predicted based on either numerical weather prediction or satellite imagery, and second, PV power output should be estimated by a performance curve of a PV system using solar irradiance as the input [7].
Recently, the prediction of solar irradiance from satellite imagery using radiative transfer models was adopted [8]. The University of Arizona Solar Irradiance Based on Satellite/Korea Institute of Energy Research (UASIBS-KIER) model of Kim et al. [9] reliably produces down-welling surface shortwave radiation every 15 min at a 1-km 2 spatial resolution by employing the geostationary weather satellite COMS (Communication, Ocean, and Meteorological Satellite), which was launched in 2010 and scans the Oceanian hemisphere.
Not only PV power forecasting for peak load control, but also evaluations of PV potential for the establishment and management of the national supply target, require a simple yet reliable performance curve of PV systems as a function of solar irradiance [10]. Therefore, the primary purpose of this study was to derive a generalized performance curve of PV systems by correlating the PV power output of the nation's PV power plants and the satellite imagery-based solar irradiance at their locations all over the country. Sharma et al. (2010) [11] and Zhanga et al. (2018) [12] applied the linear regression of solar irradiance and PV power output to develop a PV power forecasting model, which was also the purpose of the present study. Gan et al. (2015) [13] and Field et al. (2015) [14] fitted PV power and solar irradiance with the fourth-and fifth-order polynomial equations, respectively, which was problematic because a high-order polynomial equation exhibits an unexpected curved shape outside the fitting value range.
This paper introduced the Gompertz model, a sigmoidal equation commonly used in growth analyses, as an adequate regression model for determining the PV performance curve, i.e., PV power output with respect to global horizontal irradiance (GHI). In order to correct an inconsistency of the Gompertz model and the real PV power output in low GHI ranges, a linear equation was conjoined. The regression model parameters were determined using the hourly PV power output data of 232 sites across the country for a three-year period, and the corresponding hourly GHI was predicted using the UASIBS-KIER model.

Solar Irradiance Data
The UASIBS-KIER model estimates the GHI based on the visible reflectance and infrared brightness temperature taken from COMS satellite imagery over the Korean Peninsula. The GHI is usually derived from the look-up table pre-generated by the Goddard Space Flight Center Radiative Transfer Model [15] to obtain discrete values of the solar zenith angle, surface albedo, ozone, water vapor, aerosol/clout optical depths, and so forth. The accuracy of the UASIBS-KIER model was validated to be the root mean square error of 9.1% and 15.5% for clear and cloudy skies, respectively, by comparing it with 35 ground observation stations [16].
The plane-of-array irradiance (POA), rather than the GHI, is the solar irradiance component that contributed directly to the PV module's performance. In order to evaluate the POA, not only the installation layout of the PV panel-such as the installation angle and direction-but also the direct normal insolation (DNI) should be calculated by the direct-diffuse irradiance decomposition of the GHI. Because the aforementioned work has yet to be completed, the verified GHI data were employed first.

PV Power Generation Data
In 2018, the number of PV power plants in South Korea was about 35,000 of which we filtered the power generation data of 600 PV power plants (N = 600) which were grid connected, registered with the Korea Power Exchange (KPX), had over 50 kW capacity, and were in operation during the period January 2014 to December 2016.
The stratified random sampling method was used to calculate the appropriate size of the sample for each province, which accounted for 242 if the numbers were rounded up, with the study sample being larger than 40% of the total number of PV power plants in the country (Table 1). Almost 50% of the PV power plants were located in Jeollanam-do and Gyeongsangbuk-do ( Figure 1).  The 10 randomly selected PV power plants were reserved for the validation test of statistical regression, meaning that the dataset of 10 sites was not used for creating the model.
According to Jordan and Kurtz (2013) [18], the median performance degradation due to the aging of the PV module was estimated to be −0.5%p per year. The mean life of the 242 PV power plants used in this study was 4.6 years so that the estimated overall degradation rate was -2.3% p. However, only three years of PV power output data were used in this study, which was insufficient to enable a statistical analysis of the aging trend. Moreover, it was difficult to identify the aging effect before excluding the other effects caused by various environmental conditions. For these reasons, this study did not consider the aging effect. Figure 2 shows a typical PV module performance with respect to solar irradiation. Because of the nonlinearity of the PV module performance [19], a linear regression depicted as a dashed line was unsuitable for expressing the performance curve of the PV module. The Gompertz function is a sigmoid curve, which describes asymptotic growth as being the slowest at the end of a given time period or at the maximum of a given variable. With regard to the solar energy field, a symmetrized and shifted Gompertz function was applied to model the I-V curve of the PV module [20].

Gompertz Function
Therefore, the performance curve of the PV system as the Gompertz function of GHI (W/m 2 ) was expressed as the following equation: where A is the asymptote, B is the x-axis displacement, and C is the growth rate. First, the hourly PV power output P (kW) at each power plant was normalized by dividing it by its nominal capacity P N (kW). Second, a regression analysis was conducted to find the best fitness between the hourly GHI and the normalized PV power output by employing the Gompertz function. Figure 3 shows the effects of the coefficients: A controlled the asymptote maximum upward and downward, B shifted the curve left and right, and C changed the growth rate. Its sigmoid shape was asymmetrical and had an inflection point that converges to a maximum value, which was ideally P N , as the PV power output generally bended to the nominal capacity.

Linear-Gompertz Conjoint Function
The Gompertz curve poorly represents the correlation between GHI and P near the origin, which means that y converges to A·e −e B as x approaches zero. This mathematical behavior violates the laws of physics, i.e., the PV power output should be zero when there is no solar irradiance.
In order to prevent this nonrealistic behavior of the regression model, a linear equation having the slope D with a y-intercept of zero was combined to the Gompertz function near the origin [21].
The conjoint curve was expressed as follows: For a smooth connection of the two curves, it was necessary to impose the following conjoint conditions: where x j , y j is the connection point of the two functions. By substituting Equation (6) with Equation (5), the following equation was obtained: Equation (7) can be solved analytically by introducing the Lambert W-function [22], and the following solution was obtained: where W(z) is the Lambert W-Function, the inverse function of f (v) = v·e v . There were multiple solutions for x j , but the smallest x j was chosen for the present study. Figure 4 demonstrates the conjoint case when (A, B, C) = (0.77, 1.10, 0.004). Once the Gompertz coefficients were derived by regression analysis, it was possible to calculate x j by Equation (8), D by Equation (6), and finally, y j by Equation (4).

Evaluation of Regression
The fitness of the regression model was evaluated with the normalized root mean square error (nRMSE) with respect to the nominal capacity P N , which was defined as: where T is the total number of power generation hours (P > 0), and P L-G and P KPX stand for the PV power predicted by the Linear-Gompertz (L-G) regression model and the real PV power output data provided by KPX, respectively. The coefficient of determination (R 2 ) was also evaluated for the linear regression (as the reference) and the Linear-Gompertz regression for comparison. The evaluation of regression was carried out by year, by season, and by province to confirm whether the Linear-Gompertz model was suitable as the PV performance curve. The final validation was performed for the 10 sites, which were not used to derive the Gompertz model coefficients.

Comparison of the Regression Models
Regression fitness, as measures of R 2 and nRMSE, was evaluated for the 232 PV power plants, excluding the 10 sites which were reserved for validation. Table 2 compares the mean (µ) and the standard deviation (σ) of the fitness measures between the Linear (L) model and the Linear-Gompertz model by year. Of the overall values of the measures, the fitness of L-G was better than L, as R 2 was higher and nRMSE was lower for the L-G regression. For the statistical confirmation, t-tests assuming equal means (Equation (10)) were performed, resulting in the rejection of the null hypothesis (p-values < 0.05), which confirms that the means of L and L-G were statistically different. In other words, R 2 and nRMSE of L-G were statistically higher than those of L by about 2.5% and 11% respectively.  It was difficult to determine whether the L-G model substantially outperformed the L model because the solar irradiance in South Korea was not strong (<1300 W/m 2 ) and most of the GHI and PV power output data were concentrated in the linear section (<1000 W/m 2 ) of the L-G regression model. However, the comparison results obtained after selecting 10 PV power plants of high solar irradiance in excess of 1000 W/m 2 verified that the R 2 improvement in the L-G model was 6%, which is significantly higher than 2.6%.
µ L−G − µ L = 0 (10) Figure 5 shows the spatial distribution of R 2 and nRMSE of the Gompertz model over South Korea, in which no obvious spatial pattern can be observed, i.e., a random pattern was present. For reference, the correlation coefficient between R 2 and nRMSE was −0.86.

Gompertz Model Coefficients
Before proceeding with the statistical analysis, the Anderson-Darling [23] normality tests for A, B, and C, which constitute an effective normality test for small samples, were carried out. The results of the normality test confirmed that all of the Gompertz coefficients had a normal distribution.
The means and standard deviations of the Gompertz coefficients derived from the regression of 232 sites are summarized in Table 3. As Table 3 shows, the means of A and B varied slightly from year to year, but the standard deviations of all the coefficients did not vary according to the equal variance F-test. Moreover, the standard deviations of A and B varied within 6.5% of their means, while C showed a wider variation of 15%. Given that the statistical characteristics of the Gompertz coefficients had normal distributions, equivariance, and small standard deviations, it is possible to conclude that the Gompertz coefficients represent the whole country's PV system performance characteristics.  Figure 6 shows comparison of the Gompertz curves by province. The standard deviations of these curves obtained from the generalized Gompertz curve, whose coefficients are given in Table 3, were within 5% across the whole GHI range.  Figure 7 shows the relationship between the Gompertz coefficient C and the capacity factor (CF) of the 232 sites. It can clearly be seen that the capacity factor and the Gompertz coefficient C had a strong correlation of R 2 = 0.53, while the Gompertz coefficients A and B showed no correlation with the capacity factor (R 2 < 0.02). This can be interpreted to mean that the Gompertz coefficient C controlled the variance of the curve at a given solar irradiance depending on such environmental factors as air temperature. Therefore, it was necessary to analyze the Gompertz coefficient by season in order to identify the effects of air temperature. Using the strong correlation between C and CF, it was possible to derive an equation for CF as a function of the Gompertz coefficients through a multivariate regression, as follows: For reference, the regression fitness of Equation (11) showed a fitness of R 2 = 0.88, which was higher than that of C and CF, by accommodating the interconnected relationship between C and A, as well as B.
The variations of the Gompertz coefficient by season (as listed in Table 4) were evaluated in order to ascertain whether the ambient temperature had a major impact on the plants' performance. It is well-known that the PV module's efficiency decreases when its temperature rises to a certain threshold value, adversely affecting summertime power production [24].   The seasonal variations of the Gompertz coefficients A, B, and C with respect to their annual means were 16%, 20%, and 32%, respectively (Table 4). This implies that the Gompertz coefficient C (or capacity factor) had obvious seasonality due to changes of the environmental parameters by season, such as the ambient temperature. In terms of the capacity factor, winter, autumn, spring, and summer were from the highest to lowest order. The fitness of the seasonal Gompertz model increased to R 2 = 0.88. Table 5 summarizes the results of the validation of the generalized Linear-Gompertz model given in Equation (4) with the mean values given in Table 3. It is confirmed that the fitness measures showed the same level of R 2 and nRMSE, as shown in Table 2. These conclusions mean that if the PV power generation was estimated using the L-G model throughout South Korea, the expected nRMSE was 0.09, which is 10% of the P N . This error includes not only the error of the regression model, but also that of the input data GHI. According to a study conducted by Kim et al. (2017), the GHI prediction of the UASIBS-KIER model ranged between 7.4%~16.7% in terms of the rRMSE [16]. This level of error is sufficient for the purpose of evaluating the PV potential of South Korea. However, considering that the average rate of error of general one-hour-ahead forecasting models is nRMSE = 7.2% [25], the error in the results of the present study is somewhat large for a PV power forecasting model.

Conclusions
The following conclusions may be drawn based on the statistical analysis of the hourly correlation between the power output of 242 PV power plants (out of a total of 35,000 power plants) in South Korea in 2018, and the weather satellite-derived solar irradiation on their installation locations.
The main results of this study are summarized as follows: (1) The Linear-Gompertz model successfully expressed the sigmoidal characteristics of the PV system performance countrywide as a single function of GHI, which is the simplest regression form adequate for machine learning needed to develop a forecasting model. The nonphysical trend of the Gompertz model in the low GHI range was fixed by combining a linear equation having the same slope at the conjoint point. The fitness of the Linear-Gompertz regression was R 2 = 0.85, and the nRMSE of normalized power output ratio was 0.09. (2) The three Gompertz coefficients A, B, and C were calculated by year, by season, and by province, and it was found that they had normal distributions and equivariances, meaning that the Gompertz coefficients were the general parameters for the entire country. Moreover, the Gompertz coefficient of the growth rate C showed a strong correlation (R 2 = 0.53) with the capacity factor of the PV power plant. Therefore, it was possible to derive the capacity factor equation as a function of A, B, and C, that showed a fitness of R 2 = 0.88. (3) In order to use the Linear-Gompertz model to obtain South Korea's general PV performance curve for PV power forecasting, it will be necessary to increase the fitness of the model to over R 2 > 0.9 by including significant environmental variables such as ambient temperature. Future research will consist of securing long-term PV power output data and analyzing the aging effect of the PV panel to correct the degradation effect. In addition, the accuracy of the Linear-Gompertz model will be improved by calculating and applying the POA, the primary input variable, instead of GHI. To that end, the solar irradiance decomposition algorithm should be improved in the UASIBS-KIER model, and verification and compensation steps using actual measurement data should be implemented in advance. (4) Because the solar irradiance in most regions of South Korea is less than 1300 W/m 2 , an additional verification of the conditions of high solar irradiance is needed to apply this result to regions with high solar irradiance. In addition, since PV power generation is significantly influenced by climate conditions, there will be some differences compared to regions in which the climate zone is completely different from that of South Korea. However, South Korea has four distinct seasons and a wide temperature distribution ranging from −15 • C to +35 • C throughout the year. Thus, the effect of climate conditions on PV power generation is relatively significant, which means the prediction error that occurred when the present regression model was applied to other climate zone is expected to be smaller in a relative sense.