Analysis and Impact Evaluation of Missing Data Imputation in Day-ahead PV Generation Forecasting

Over the past decade, PV power plants have increasingly contributed to power generation. However, PV power generation widely varies due to environmental factors; thus, the accurate forecasting of PV generation becomes essential. Meanwhile, weather data for environmental factors include many missing values; for example, when we estimated the missing values in the precipitation data of the Korea Meteorological Agency, they amounted to ~16% from 2015–2016, and further, 19% of the weather data were missing for 2017. Such missing values deteriorate the PV power generation prediction performance, and they need to be eliminated by filling in other values. Here, we explore the impact of missing data imputation methods that can be used to replace these missing values. We apply four missing data imputation methods to the training data and test data of the prediction model based on support vector regression. When the k-nearest neighbors method is applied to the test data, the prediction performance yields results closest to those for the original data with no missing values, and the prediction model’s performance is stable even when the missing data rate increases. Therefore, we conclude that the most appropriate missing data imputation for application to PV forecasting is the KNN method.


Introduction
With photovoltaic (PV) systems becoming a mature and environmentally-friendly technology over the past few years, several countries have begun to invest aggressively in PV energy resources.For example, China has increased its PV budget by 30.7% of that of 2017.In addition to developed countries such as China, developing ones such as the Marshall Islands, Rwanda, and the Solomon Islands are also increasingly investing in renewable energy [1], thereby leading to increased PV generation over the past decade [2]. Figure 1 [1] depicts the increasing trend of PV power generation, which implies that PV systems are expected to spread globally.
However, PV generation depends on weather conditions such as temperature, relative humidity, precipitation, and solar radiation.In other words, the PV output fluctuates frequently because of weather factors.With more PV systems integrating with grids, such fluctuating PV generation can severely impact the stability, reliability, quality, and system operation of power systems while also reducing economic benefits.Therefore, the accurate forecasting of PV generation becomes very important for grid operations involving distributed energy resources.As mentioned in [3], several studies have focused on the prediction of solar power generation and solar radiation in this context.Among the 15,700 solar irradiation and power prediction articles appearing in a Google Scholar search, 6340 were published in 2016.
Among the 15,700 solar irradiation and power prediction articles appearing in a Google Scholar search, 6340 were published in 2016.However, there are still problems in accurately forecasting solar power generation because of inaccurate predictions arising due to missing data, where "missing data" refers to completely absent data objects/series or imperfect data, i.e., partial data [4].In this context, it has been reported [5] that when the missing data ratio in the total data is less than 1%, the impact is negligible.Further, missing rates between 1% and 5% correspond to manageable or flexible sample data.On the other hand, missing data rates >5% of the total data require suitable solutions [5].Further, missing data rates of >15% significantly adversely affect the prediction model [6,7].Thus far, PV power generation prediction studies have reported poor performances on rainy days relative to sunny days [8][9][10].This implies that the existence of many missing precipitation data values significantly deteriorates the prediction accuracy.Hence, methods to address the missing data need to be urgently studied.
Several studies on missing data imputation have been conducted in multiple contexts.In [5][6][7]11,12], missing data imputation algorithms based on statistical methods and machine learning approaches such as the k-nearest neighbors (KNN) method have been proposed.Other studies [13][14][15][16] have proposed missing data imputation methods for solar irradiation.In [13], the authors proposed three missing data imputations that can replace missing data for solar radiation: inverse-distance weighting (IDW), multiple linear regression (MLR), and multivariate imputation by chain equations (MICE).Among these, MICE affords outputs closest to the original solar radiation values when the missing values are replaced.In [14], the authors estimated the temperature and relative humidity for solar irradiance prediction using the Fourier series and support vector machines (SVMs), while in [15], the authors examined several missing weather data imputation methods ranging from simple ones such as mean imputation to complicated ones such as the multilayer perceptron and Markov chain Monte Carlo approaches.In [16], the authors devised two missing imputation methods using atmospheric temperature and relative humidity: the decision matrix and the regression correlation of weather data.The first method affords a minimum correlation coefficient value of 0.95, RMSE of 87.6 W/m , NRMSE of 8.29%, and an index of agreement of 0.97 for irradiation.Here, NRMSE is the value obtained by dividing the RMSE value by the difference between the maximum and the minimum values of measured irradiation data.On the other hand, the second method yields higher error than the first method.In addition to research on solar radiation, several research works have focused on meteorological factors [17][18][19][20].In [17], the authors demonstrated the drawbacks of the IDW method and proposed three other methods: the coefficient correlation weighing method (CCWM), the artificial neural network estimation method (ANNEM), and the kriging estimation method (KEM), which outperformed conventional methods.Meanwhile, another However, there are still problems in accurately forecasting solar power generation because of inaccurate predictions arising due to missing data, where "missing data" refers to completely absent data objects/series or imperfect data, i.e., partial data [4].In this context, it has been reported [5] that when the missing data ratio in the total data is less than 1%, the impact is negligible.Further, missing rates between 1% and 5% correspond to manageable or flexible sample data.On the other hand, missing data rates >5% of the total data require suitable solutions [5].Further, missing data rates of >15% significantly adversely affect the prediction model [6,7].Thus far, PV power generation prediction studies have reported poor performances on rainy days relative to sunny days [8][9][10].This implies that the existence of many missing precipitation data values significantly deteriorates the prediction accuracy.Hence, methods to address the missing data need to be urgently studied.
Several studies on missing data imputation have been conducted in multiple contexts.In [5][6][7]11,12], missing data imputation algorithms based on statistical methods and machine learning approaches such as the k-nearest neighbors (KNN) method have been proposed.Other studies [13][14][15][16] have proposed missing data imputation methods for solar irradiation.In [13], the authors proposed three missing data imputations that can replace missing data for solar radiation: inverse-distance weighting (IDW), multiple linear regression (MLR), and multivariate imputation by chain equations (MICE).Among these, MICE affords outputs closest to the original solar radiation values when the missing values are replaced.In [14], the authors estimated the temperature and relative humidity for solar irradiance prediction using the Fourier series and support vector machines (SVMs), while in [15], the authors examined several missing weather data imputation methods ranging from simple ones such as mean imputation to complicated ones such as the multilayer perceptron and Markov chain Monte Carlo approaches.In [16], the authors devised two missing imputation methods using atmospheric temperature and relative humidity: the decision matrix and the regression correlation of weather data.The first method affords a minimum correlation coefficient value of 0.95, RMSE of 87.6 W/m 2 , NRMSE of 8.29%, and an index of agreement of 0.97 for irradiation.Here, NRMSE is the value obtained by dividing the RMSE value by the difference between the maximum and the minimum values of measured irradiation data.On the other hand, the second method yields higher error than the first method.In addition to research on solar radiation, several research works have focused on meteorological factors [17][18][19][20].In [17], the authors demonstrated the drawbacks of the IDW method and proposed three other methods: the coefficient correlation weighing method (CCWM), the artificial neural network estimation method (ANNEM), and the kriging estimation method (KEM), which outperformed conventional methods.Meanwhile, another study [18] has suggested the fixed functional set genetic algorithm method (FFSGAM), which utilizes genetic algorithms and nonlinear optimization methods to estimate the missing data.When compared with IDW, the FFSGAM yields greater prediction accuracy.Here, we note that only six rain-gauging stations in Korea are able to estimate missing data; thus, missing data imputation becomes extremely important in this context.The method proposed in [19] is similar to that in [17] in that both are based on hybrid models.The former approach utilizes an artificial neural network (ANN) and a regression tree (RT) and affords better prediction accuracy than the ANN or RT models alone.Meanwhile, in [20], the authors demonstrated 17 deterministic missing data imputations for missing precipitation data and concluded that the most suitable method is multiple linear regression weighted by the square of the missing data ratio.
Several studies have demonstrated improved prediction performance via application of these methods to classification or prediction models [21,22].In [21], the authors demonstrated a neural network model of breast cancer diagnosis with the use of three statistical methods and three missing data machine learning methods.In [22], the authors demonstrated that the conventional road traffic congestion prediction performance can be improved via application of a missing data imputation method based on machine learning techniques.
However, very few studies have applied missing weather data imputation to PV generation forecasts.Although in [23], the authors developed a PV forecasting model using missing data imputation methods, they only considered missing PV power data imputation, and not missing weather data imputation.Missing data imputation forms a significant data preprocessing issue in predicting solar power generation because precipitation data are generally inaccurate.For example, as per the Korea Meteorological Administration, the approximate missing data rate was 16% in 2015 and 2016 and 19% in 2017.In particular, it is difficult to estimate how rainfall must be considered because a large amount of precipitation data is missing.However, very few researches have focused on how solar power generation forecasts change when the missing-point replacement method is applied.
Against this backdrop, here, we generate randomly missing data in the weather data of the period of 2016-2017 with three missing data ratios, corresponding to the missing completely at random (MCAR) process [24].Next, we replace the missing data with suitable values by using four different approaches, linear interpolation (LI), mode imputation (MI), k-nearest neighbors (KNN), and multivariate imputation by chain equations (MICE).Thirdly, for four cases that utilize the missing data imputation methods and one that does not do so, we construct forecasting models using the SVR method with the solar power generation data for 2016.Finally, we predict solar PV generation for 2017 with the forecasting model and missing value-corrected weather data of 2017.Further, we calculate and compare the forecast errors for a specific period in 2017, and we examine the accuracy resulting from the application of the missing data imputation methods.
The rest of the paper is organized as follows: Section 2 describes the meteorological and historical PV power generation data used in predicting solar power generation along with the four missing data imputation methods.Section 3 describes the error evaluation performance of the missing data imputation methods applied to the model.In Section 4, we calculate the error of the missing data imputation for each method and present the results.Finally, Section 5 concludes the paper.

Data and Prediction Model
The solar power generation data are based on the PV dataset provided by the Korea Southern Power Co., Ltd.(KOSPO).The dataset can be downloaded in the CSV, JSON, etc., file formats from the Korea Open Data Portal [25].The time step of the data is 1 h, and it covers the period from 1 January 2015-19 September 2017.In this study, the data from 1 January-31 December 2016 are used as the PV power prediction model training data, and data from 1 January-19 September 2017 are used as test data.The solar power plant of interest is located at 55, Hwasunhaean-ro 106 beon-gil, Andeok-myeon, Seogwipo-si, Jeju-do, Republic of Korea (around latitude 33.24 • and longitude 126.34 • ).The peak output PV power of the plant is 196 kW.
The meteorological hourly data can be downloaded by use of the open weather data portal provided by the Korea Meteorological Administration [26].In the study, we used the local live forecast data of Jeju Island.The weather conditions included temperature, relative humidity, wind speed, wind direction, precipitation, and sky condition.Here, the sky condition expresses the amount of cloud cover as a numerical value in the range from 1-4, where 1 indicates the least cloud cover and 4 the highest cloud cover.As in the case of 1-h time-step power generation data, 1-h time-step weather data of 2016 were used to train the prediction model, and 1-h time-step weather data for 2017 were used for the tests.Time and month values were used in the model to capture the time and seasonal characteristics not reflected by weather variables.Here, we mention that the weather data factor most closely related to solar power generation is solar irradiance.However, meteorological data portals do not provide the predicted solar irradiance data.Therefore, we used the solar irradiance data from the Python Library (PVLIB) [27].The library was utilized as follows: First, we defined the time zone of solar radiation data to be obtained.We next calculated the zenith and azimuth angles with time by inputting time data into the get_solarposition function of PVLIB with the time zone.Next, we used the get_clearsky function of PVLIB to calculate various solar radiation data such as global horizontal irradiation (GHI), direct normal irradiation (DNI), and diffuse horizontal irradiation (DHI).A detailed description cab be found in [28].We note here that it is possible to compute the solar irradiation amount per hour at the PV panel location on sunny days by combining the three solar irradiation datasets, the Sun's zenith angle, azimuth, latitude, longitude, time zone, tilt of the solar panel, and azimuth information of the surface.The azimuth information corresponds to the plane of array irradiance components on a tilted surface.This value does not reflect weather conditions.Even though it is not considered in solar radiation, we estimate that this factor can be predictable even on days with cloudy weather, because the prediction model is learned by using the predicted amount of precipitation and cloud amount data.
The relationship between the input and output data is shown in Figure 2. The meteorological data and the power generation data are 1-h time-step data, and the meteorological data predicted at 23:00 the day before are used for the PV generation prediction model for the next day.The PV forecasting is made on day N, and the forecasting model is used to calculate the hourly power generation of day N + 1.The time horizon depends on the time of the forecasting time, and it ranges from 1 h to a maximum of 25 h.For example, when the amount of power generated at 01:00 is predicted, the forecast is made at 23:00 on the day before, i.e., the forecast is made two hours ahead.In contrast, to estimate the power generation at 22:00, the forecast is made at 23:00 on the previous day.Here, we assume that the prediction of the next day's power is made simultaneously with the release of the forecast data.Our prediction model was trained by 9 input features and 1 output dataset corresponding to year 2016, and it was validated by 9 input features and 1 output dataset corresponding to 2017.As shown in Figure 2, the 9 inputs are: month, hour, temperature ( • C), relative humidity (%), wind speed (m/s), wind direction ( • ), solar irradiance (W/m 2 ), precipitation (mm), and cloud cover.In the study, the missing data imputation was performed for all lossy data, and PV generation was predicted by means of SVR.The detailed theory and SVR optimization process are presented in [10].In addition, we used the regression learner app in MATLAB R2018a software to implement and test the model.In the regression learner app, the median Gaussian SVM was selected as it affords the highest accuracy among SVR models.The median Gaussian SVM algorithm was executed on a computer with an Intel ® Core™ i7-7700 CPU clocked at 3.60-GHz with 16.0 GB of RAM. Figure 3 depicts our power prediction model.In Figure 3, LI, MI, KNN, and MICE denote the linear interpolation, mode imputation, k-nearest neighbors, and multivariate imputation by chain equation algorithms, respectively.A more detailed description of the four methodologies is provided in Section 2.2.

LI
LI is a method that replaces median missing data by using data at both ends of the data series.For example, let us consider the sequence of data samples shown in Figure 4, D n (T) = D(nT), with uniform sampling of function D(T) with step T.
In Figure 5, it is assumed that there is a linear relationship between data D m (T) and data D n (T).

LI
LI is a method that replaces median missing data by using data at both ends of the data series.For example, let us consider the sequence of data samples shown in Figure 4, D (T) = D(nT), with uniform sampling of function D(T) with step T.
In Figure 5, it is assumed that there is a linear relationship between data D (T) and data D (T).

D (T)
Figure 5. Datasets with missing data.
Thus, D (T) can be calculated by use of Equation ( 1) as below: Likewise, other missing data can be filled by LI.This method is stable when missing data are sparse and discrete.Figure 5 depicts the structure of a hypothetical dataset with missing values.If the magnitude of n in a dataset is small, LI is easy, fast, affords accuracy, and provides smooth results.These criteria imply that there is a small amount of missing data.This also suggests that the LI is comparatively sound, since data are not likely to fluctuate rapidly in a short time frame.However, when the missing data interval lengths increase, the imputation accuracy degrades severely [22].

MI
MI is a method wherein missing data are replaced with the most frequently-observed data.It is one of the simplest missing data imputation approaches, wherein the mode forms the criterion of the central trend.It includes the fundamental data distribution in the dataset.Thus, we can select the most frequent numerical quantity to fill in the missing data, which makes the method simple and easy to apply.However, this method is unsuitable when there is a complicated relation between the data in the dataset [11].In addition, there is a specific pattern for daily solar irradiance data based on the position of the Sun in the sky, which cannot be restored by a simple method such as MI.Nevertheless, we select this method in order to show that when the missing data are processed in an undesirable way, the prediction deviates drastically.

Imputation Using KNN
The KNN imputation is based on machine learning, which has been extensively used for classification, regression, and imputation [30].The method considers the k most similar data of the uncompleted dataset [29][30][31][32].In other words, these k-nearest neighbors aid in replacing missing data with estimates.Given a broken data pattern x, the method adopts the k-closest instances that are complete values in the features to be imputed (i.e., attributes with missing values in x) such that they decrease by a certain distance value as far as possible [29].In our study, ν = { } represents the set of KNN of x arranged in ascending order of distance.Although the KNN can be chosen for objects without missing values, it is also encouraged to substitute corrected data for missing data [30,32].
To process missing values with the use of KNN, let us suppose that x includes a missing value on the j th input feature.After its KNN are selected, we set ν = { } , and the unknown value is estimated using the corresponding j th feature values of ν.If the j th input feature is a numerical or continuous variable, different estimation procedures can be implemented in the KNN approach.We weigh the contribution of each according to its distance to x, i.e., d(x, ), assigning higher weights to closer neighbors.The equation for the estimation of the values is given below.Thus, D n (T) can be calculated by use of Equation (1) as below: Likewise, other missing data can be filled by LI.This method is stable when missing data are sparse and discrete.Figure 5 depicts the structure of a hypothetical dataset with missing values.If the magnitude of n in a dataset is small, LI is easy, fast, affords accuracy, and provides smooth results.These criteria imply that there is a small amount of missing data.This also suggests that the LI is comparatively sound, since data are not likely to fluctuate rapidly in a short time frame.However, when the missing data interval lengths increase, the imputation accuracy degrades severely [22].

MI
MI is a method wherein missing data are replaced with the most frequently-observed data.It is one of the simplest missing data imputation approaches, wherein the mode forms the criterion of the central trend.It includes the fundamental data distribution in the dataset.Thus, we can select the most frequent numerical quantity to fill in the missing data, which makes the method simple and easy to apply.However, this method is unsuitable when there is a complicated relation between the data in the dataset [11].In addition, there is a specific pattern for daily solar irradiance data based on the position of the Sun in the sky, which cannot be restored by a simple method such as MI.Nevertheless, we select this method in order to show that when the missing data are processed in an undesirable way, the prediction deviates drastically.

Imputation Using KNN
The KNN imputation is based on machine learning, which has been extensively used for classification, regression, and imputation [29].The method considers the k most similar data of the uncompleted dataset [29][30][31][32].In other words, these k-nearest neighbors aid in replacing missing data with estimates.Given a broken data pattern x, the method adopts the k-closest instances that are complete values in the features to be imputed (i.e., attributes with missing values in x) such that they decrease by a certain distance value as far as possible [29].In our study, ν = {v k } K k=1 represents the set of KNN of x arranged in ascending order of distance.Although the KNN can be chosen for objects without missing values, it is also encouraged to substitute corrected data for missing data [30,32].
To process missing values with the use of KNN, let us suppose that x includes a missing value on the j th input feature.After its KNN are selected, we set ν = {v k } K k=1 , and the unknown value is estimated using the corresponding j th feature values of ν.If the j th input feature is a numerical or continuous variable, different estimation procedures can be implemented in the KNN approach.
We weigh the contribution of each v k according to its distance to x, i.e., d(x, v k ), assigning higher weights to closer neighbors.The equation for the estimation of the values is given below.
where w k represents the weight related to the k th neighbor.A suitable option for w k corresponds to the inverse square of the distance, as described by Equation (3).
In this study, we use this weighted method to estimate missing values in the cases of numerical and distinct variables, and further, we arbitrarily set k = 3.It has been shown that this method is robust for missing data estimation [29][30][31][32].Its primary limitation is that whenever the method searches for the most similar instances, the algorithm has to estimate the entire dataset.The method is particularly problematic for application to large databases because it is time consuming and computationally expensive for big data.

Multivariate Imputation by Chain Equations
The MICE algorithm designed by van Buuren and Groothuis-Oudshoorn [33] is based on a Markov chain Monte Carlo method wherein the state space is the collection of all imputed values [13].The merit of MICE is that the results are computed after a comparatively few numbers of iterations.As per some studies [33,34], in general, five iterations are usually sufficient.The MICE algorithm [35] for the imputation of multivariate missing data consists of the following steps in Algorithm 1.

1.
Execute a simple imputation such as the mean imputation for every missing value in the dataset.These mean imputations are called "place holders." 2.
Set the "place holder" mean imputations about one variable ("var") to "missing" again.

3.
Develop a regression model (e.g., linear, logistic, Poisson, etc.).Here, the dependent variable is "var," while all others are independent variables.We note that the imputation model may or may not consist of all of the variables in the dataset.

4.
Replace missing values for "var" with prediction values.After "var" is used as an independent variable in the regression models for other variables, all the observed and the abovementioned imputed values are used.

5.
Repeat Steps 2-4 for each variable that has missing data.The cycling of each variable is defined as one iteration or cycle.At the end of one cycle, all of the missing values have been substituted by predictions from regressions that are related to the observed data.6.
Repeat Steps 2-4 for a number of cycles, with the imputations being updated at each cycle.

Impact Evaluation of Missing Data Imputation on PV Forecasting
In order to verify the effect of missing data imputation on PV forecasting, we used four error indices, root mean square error (RMSE), mean relative error (MRE), root mean square deviation (RMSD), and mean relative deviation (MRD), defined, respectively, as follows: The reference model refers to the trained model with the original training data and test data with no missing values.
Next, the values obtained with the iterations were averaged for each test period (i.e., 20-27 January, March, and June 2017) and missing data rates (10%, 15%, and 20%).The results corresponding to the prediction model error are listed in Table 1, while the deviation results are listed in Tables 2-4.For each missing data rate, the graphs of the PV power generation prediction with the application of the four missing data imputation methods and the reference model are shown in Figures 6-8.Here, the hourly predicted power generation values for each method are calculated as the average power generation values predicted through five iterations, and the reference model refers to the predicted model that is learned when no missing values are found in the training data and test data.Further, the average value of the power generation amount is the predicted power generation amount of five identical times generated in the five iterations.The average of the power generation values in the same time zone cannot be used to predict the power generation value every time because the positions where the missing values occur in the data are different even if the same missing value replacement method and prediction model are used.Therefore, we consider the averages of power generation in the same time zone and plot power generation profiles.This procedure also applies to Figures 9-11       Here, we carry out the same process as that described in Section 4.1.1,except that the missing data are the test data and not the training data.The errors resulting from the estimation of the power generation with the application of the four missing data imputation methods according to the missing data rate with one training dataset are listed in Tables 5-7.As mentioned in Section 4.1.1,the mean value of the error was calculated over five trials, and the average of the predicted power generation was calculated by use of the four methods; the corresponding forecasting results are shown in Figures 9-11.

Comparison of Methods and Discussion
PV prediction simulations were conducted for the case wherein the missing data imputation methods were applied to the training data and test data separately; the results for each of these cases have been presented in Sections 4.1.1 and 4.1.2,respectively.When the missing data imputation method is applied to the training data, the RMSD is at most 5 kW and the MRD is less than 2%.That is, in this case, a large error does not occur when training is performed with data having no missing values.The reason underlying this result is as follows: first, the values for predicted power generation in this study correspond to the power generation values measured in 2017.However, this is not directly affected by the training data, which had missing values corresponding to 2016.Here, we note that the weather and PV generation data are time dependent data (i.e., time series data).The predicted value for PV generation in 2017 was determined by the 2017 training data alone.Accordingly, the 2016 weather data were only used to build the prediction model and did not directly impact the data generated for 2017.Second, even if the missing data rate is at most 20%, the training model is more influenced by the remaining 80% training data, and not the 20% missing data.Therefore, although missing data occur in the training data, there is no noticeably large error.
On the other hand, there is a noticeable difference between when the missing data imputation method is applied to the test data and when this method is not applied.Because the test data correspond to 2017, the predicted values for 2017 are directly affected.For example, the PV power generation prediction for 20 January 2017 was determined using meteorological data, such as solar radiation, temperature, and precipitation, as well as predicted time of day.Since the independent to dependent variable relationship is established between data of the same time, the value for the predicted PV power generation changes as values for the weather data are changed.From Tables 5-7, it can be observed that the error between the reference model and predicted model with missing data imputation increases with increasing missing data rates.When the missing data rate is low, LI affords the least error.However, for high missing data rates, the KNN method yields the smallest error.As the missing data rate increases, the amount of adjoining missing values increases in the datasets.LI is suitable for missing value imputations at a specific point in time, but it is not accurate for missing values in successive sections.Thus, when the missing data rate is 20%, the error of LI rapidly increases.MI yields the largest error value regardless of the missing data rate and test duration.This method is suitable for discrete data; however, weather data are continuous data.In conclusion, these simulations indicate that the KNN method is the most suitable missing data imputation method for day-ahead PV power forecasting.

Conclusions
The prediction of PV power generation is important in situations wherein the number of new and renewable power generation systems joining the grid increases.In this regard, many machine learning-based methods for power generation prediction have been studied and utilized.In order to ensure increased accuracy, it is required to compensate for missing meteorological data values

Comparison of Methods and Discussion
PV prediction simulations were conducted for the case wherein the missing data imputation methods were applied to the training data and test data separately; the results for each of these cases have been presented in Sections 4.1.1 and 4.1.2,respectively.When the missing data imputation method is applied to the training data, the RMSD is at most 5 kW and the MRD is less than 2%.That is, in this case, a large error does not occur when training is performed with data having no missing values.The reason underlying this result is as follows: first, the values for predicted power generation in this study correspond to the power generation values measured in 2017.However, this is not directly affected by the training data, which had missing values corresponding to 2016.Here, we note that the weather and PV generation data are time dependent data (i.e., time series data).The predicted value for PV generation in 2017 was determined by the 2017 training data alone.Accordingly, the 2016 weather data were only used to build the prediction model and did not directly impact the data generated for 2017.Second, even if the missing data rate is at most 20%, the training model is more influenced by the remaining 80% training data, and not the 20% missing data.Therefore, although missing data occur in the training data, there is no noticeably large error.
On the other hand, there is a noticeable difference between when the missing data imputation method is applied to the test data and when this method is not applied.Because the test data correspond to 2017, the predicted values for 2017 are directly affected.For example, the PV power generation prediction for 20 January 2017 was determined using meteorological data, such as solar radiation, temperature, and precipitation, as well as predicted time of day.Since the independent to dependent variable relationship is established between data of the same time, the value for the predicted PV power generation changes as values for the weather data are changed.From Tables 5-7, it can be observed that the error between the reference model and predicted model with missing data imputation increases with increasing missing data rates.When the missing data rate is low, LI affords the least error.However, for high missing data rates, the KNN method yields the smallest error.As the missing data rate increases, the amount of adjoining missing values increases in the datasets.LI is suitable for missing value imputations at a specific point in time, but it is not accurate for missing values in successive sections.Thus, when the missing data rate is 20%, the error of LI rapidly increases.MI yields the largest error value regardless of the missing data rate and test duration.This method is suitable for discrete data; however, weather data are continuous data.In conclusion, these simulations indicate that the KNN method is the most suitable missing data imputation method for day-ahead PV power forecasting.

Conclusions
The prediction of PV power generation is important in situations wherein the number of new and renewable power generation systems joining the grid increases.In this regard, many machine learning-based methods for power generation prediction have been studied and utilized.In order to

Figure 2 .
Figure 2. Time step of data and time horizon of the forecasting model.

Figure 4 .
Figure 4. Process of the simulation for PV forecasting model.

Figure 2 .
Figure 2. Time step of data and time horizon of the forecasting model.

Figure 2 .
Figure 2. Time step of data and time horizon of the forecasting model.

Figure 4 .
Figure 4. Process of the simulation for PV forecasting model.

Figure 3 .
Figure 3. Support vector regression (SVR)-based photovoltaic (PV) power prediction model utilizing missing data imputation.LI, linear interpolation; MI, mode imputation; MICE, multivariate imputation by chain equation.Further, Figure 4 depicts the simulation process of the proposed model.

Figure 2 .
Figure 2. Time step of data and time horizon of the forecasting model.

Figure 4 .
Figure 4. Process of the simulation for PV forecasting model.

Figure 4 .
Figure 4. Process of the simulation for PV forecasting model.

Figure 8 .
Figure 8. Forecasting results from 20-27 June for various missing data rates (10%, 15%, and 20%).4.1.2.Evaluation of Forecasting Model Performance with Missing Data Imputation for the Test Data CaseHere, we carry out the same process as that described in Section 4.1.1,except that the missing data are the test data and not the training data.The errors resulting from the estimation of the power generation with the application of the four missing data imputation methods according to the missing data rate with one training dataset are listed in Tables5-7.As mentioned in Section 4.1.1,the mean value of the error was calculated over five trials, and the average of the predicted power generation was calculated by use of the four methods; the corresponding forecasting results are shown in

Table 1 .
Error of PV forecasting reference model.

Table 2 .
Deviation of missing data imputation from the PV forecasting model in January using the test set.MRD, mean relative deviation;

Table 3 .
Deviation of missing data imputation from the PV forecasting model in March using the test set.

Table 4 .
Deviation of missing data imputation from the PV forecasting model in June using the test set.

Table 5 .
Deviation of missing data imputation from the PV forecasting model with test data in January.

Table 6 .
Deviation of missing data imputation from the PV forecasting model with test data in March.

Table 7 .
Deviation of missing data imputation from the PV forecasting model with test data in June.

Table 5 .
Deviation of missing data imputation from the PV forecasting model with test data in January.

Table 6 .
Deviation of missing data imputation from the PV forecasting model with test data in March.

Table 7 .
Deviation of missing data imputation from the PV forecasting model with test data in June.Forecasting results for missing data rate 10% in test dataset (b) Forecasting results for missing data rate 15% in test dataset (c) Forecasting results for missing data rate 20% in test dataset