A Novel Hybrid Spatio-Temporal Forecasting of Multisite Solar Photovoltaic Generation

: Currently, the world is actively responding to climate change problems. There is signiﬁcant research interest in renewable energy generation, with focused attention on solar photovoltaic (PV) generation. Therefore, this study developed an accurate and precise solar PV generation prediction model for several solar PV power plants in various regions of South Korea to establish stable supply-and-demand power grid systems. To reﬂect the spatial and temporal characteristics of solar PV generation, data extracted from satellite images and numerical text data were combined and used. Experiments were conducted on solar PV power plants in Incheon, Busan, and Yeongam, and various machine learning algorithms were applied, including the SARIMAX, which is a traditional statistical time-series analysis method. Furthermore, for developing a precise solar PV generation prediction model, the SARIMAX-LSTM model was applied using a stacking ensemble technique that created one prediction model by combining the advantages of several prediction models. Consequently, an advanced multisite hybrid spatio-temporal solar PV generation prediction model with superior performance was proposed using information that could not be learned in the existing single-site solar PV generation prediction model.


Introduction
The issue of rapid climate change caused by industrialization, fossil fuel depletion, and carbon emissions is emerging worldwide [1]. Therefore, the Kyoto Protocol (1997) and Paris Agreement (2016) have been concluded for decarbonization in countries globally [2,3]. South Korea is one of the top 10 countries with the highest per capita carbon emissions. In response, the South Korean government announced the Renewable Energy 3020 Plan (2017) to achieve 20% renewable energy generation by 2030 and supply more than 95% of new facilities with clean energy, such as solar PV and wind power [4]. For solar PV generation, the most popular are clean energy, large scale solar PV farms have been constructed worldwide because of the decline in the cost of solar panels and facilities of power generation systems over the past decade [5]. The United States, Germany, and China have representative gigawatt-scale solar PV farms. South Korea has expanded to 5.7 GW in 2017, constituting 38% of the total capacity of renewable energy in the country, starting with 467 MW solar PV farms in 2013 [6].
Solar PV generation is a technology that generates electricity by converting sunlight into electricity through the photoelectric effect when light energy from the sun passes through the atmosphere and is absorbed by the solar panel. It has the advantage of clean and infinite resources [7]. Compared to other renewable energy generation fields, installation and maintenance costs are low, and the life expectancy is more than 20 years. Furthermore, minimal damage to the nature around the power plant occurs when installing the power plant. However, solar PV generation requires a large installation area because of its low energy density, and the amount of solar PV generation reacts sensitively to fluctuations in external meteorological factors such as clouds moving by wind, naturally occurring yellow dust, or particulate matter (PM) generated from the city center. These changes in meteorological factors are fluid and complex, preventing the prediction of solar PV generation, causing anxiety in the system stability of the Smart Grid, a technology combining information and communication technology with the power grid [8]. Consequently, accurate demand forecasting technology that contributes to stabilize power supply and demand is critical. If an accurate supply and demand plan is not established, it can incur huge financial and social losses, such as blackouts and consuming more resources than necessary. Therefore, accurate forecasting of power generation for renewable energy sources is critical in establishing an efficient power supply and demand plan.
Recently, air pollution caused by PM has emerged as a social issue in South Korea [9]. As the PM concentration in the atmosphere increases, it absorbs or scatters solar radiation before passing through the atmosphere and reaching the surface, reducing the amount of irradiance reaching the solar panel. Most studies have been conducted in Southeast Asia, where the effects of red soil in the dry regions of the Middle East have been analyzed or where the natural and anthropogenic emissions of PM are higher than that in other regions [10][11][12]. Furthermore, these studies analyzed the phenomenon of various types of dust accumulated on the solar panel rather than the influence of PM concentrations distributed in the atmosphere. Therefore, this study analyzes and reflects on the effects of concentrations of other air pollutants, including PM 10 and PM 2.5 , on solar PV generation.
Solar PV generation prediction can be classified into the direct prediction method of solar PV generation using various independent parameters and the indirect prediction method of solar PV generation using predicted irradiance as independent parameters. The prediction parameters can also be classified into two methods. The first method uses text data numerically composed of parameters, such as temperature, humidity, and precipitation, provided by the Meteorological Agency [13][14][15][16][17]. The numerical text data of various time units comprise hourly data, and the amount of solar PV generation is predicted using the time-series characteristics contained in the data organized with time. However, this method does not reflect the spatial characteristics of parameters such as clouds and PM displaced by the wind. The second method uses motion vectors or indices of clouds and aerosols in satellite images [18][19][20][21][22]. The shading from the clouds and scattering of light from yellow dust or PM cause significant fluctuations in the amount of insolation, which has the most direct influence on solar PV generation prediction. The increase or decrease in irradiance can be reflected by tracking the motion vector of cloud and aerosol movement appearing in the satellite image. However, as satellite images occupy a large area, it is challenging to obtain detailed information about a specific area to predict solar PV generation.
Clouds and PM values change with time at the observation point. However, when measured by expanding the observation area, clouds and PM have spatial characteristics that are moved by the wind. Therefore, to predict the amount of solar PV generation, a hybrid spatio-temporal model was developed by combining numerical text data and information extracted from the satellite image [23], unlike the methods using numerical text data or satellite images individually, as in previous studies [13][14][15][16][17][18][19][20][21][22]. It combines the time-series characteristics from numerical text data and spatial characteristics from satellite images simultaneously to predict solar PV generation. However, the hybrid spatio-temporal prediction model in a previous study predicted solar PV power plants in a single region [23]. The amount of solar PV generation in the single site fluctuates sensitively to climate change, however, if the solar PV generation in multiple distant regions is aggregated, extreme fluctuations in solar PV generation can be prevented using the smoothing effect to operate an efficient power supply and demand plan. Therefore, in this study, to solve the climate change sensitivity problem of a single-site solar PV generation and overcome the performance of a single-site prediction model, multiple regions were analyzed and an advanced integrated solar PV generation prediction model was developed in South Korea. The single-site solar PV generation prediction model predicted the solar PV generation of only one solar PV power plant, located in Incheon; therefore, to predict a multisite solar PV generation, the solar PV power plants in two regions, Busan and Yeongam, were added to the study. By developing an advanced multisite integrated solar PV generation prediction model in South Korea, the amount of solar PV generation for future new solar PV power plants can also be predicted by simply filling out facility and geographical information for each solar PV power plant. Therefore, this study proposed an advanced multisite integrated hybrid spatio-temporal solar PV generation prediction model in South Korea. It combined spatial information data extracted from satellite images, reflecting the analysis of wider spatial characteristics with numerical weather data mainly used in conventional solar PV generation prediction studies.
Various machine learning algorithms and prediction techniques were used to predict the amount of solar PV generation [24][25][26][27][28][29]. An hourly advanced multisite integrated hybrid spatio-temporal solar PV generation prediction model was developed that is more accurate and precise than a single-site solar PV generation prediction model. Various prediction models using machine learning algorithms such as the SARIMAX, SVR, DNN, LSTM, Random Forest, and SARIMAX-LSTM models were used.

Research Framework
This study develops an hourly advanced multisite integrated hybrid spatio-temporal solar PV generation prediction model in South Korea. The prediction model uses meteorological numerical text data provided by the Korea Meteorological Agency (KMA) and spatial information data extracted from satellite images to reflect both temporal and spatial characteristics. By reflecting the spatio-temporal characteristics, higher prediction accuracy can be derived than the model using only existing numerical text data and satellite images. Figure 1 shows the overall flow of this study. The first step is to select solar PV power plants in three cities in South Korea, namely, Incheon, Busan, and Yeongam. A database (DB) was built by collecting and preprocessing meteorological information provided by the KMA in each region and satellite images provided by the National Meteorological Satellite Center (NMSC). The second step extracted the necessary spatial information from four satellite images. In the atmospheric motion vector (AMV) image, the wind direction vector and wind speed, the amount of cloud and thickness of the cloud in the cloud optical thickness (COT) image, the amount of PM and PM concentrations in the aerosol optical depth (AOD) image, and the amount of irradiance were extracted from the insolation (INS) image. The third step was to set the center coordinates for each region and the region of interest (ROI) around it. Furthermore, the ROI adj is set to the same size as the ROI for the eight adjacent directions to the ROI. To learn spatial information from the solar PV generation prediction models, the effects of cloud and PM on wind direction were analyzed in ROI adj and ROI. The fourth step was combining the meteorological numerical text data DB built in the first step and the data DB extracted from satellite images and performing a correlation analysis between each meteorological parameter, including clouds and PM, and the amount of solar PV generation. Finally, the fifth step was to develop predictions by applying the SARIMAX, traditional time-series analysis method, SVR, DNN, LSTM, Random Forest, and the SARIMAX-LSTM model, which incorporates the advantages of each method, for developing an hourly advanced multisite integrated hybrid spatio-temporal solar PV generation prediction model. Later, parameter optimization was performed for each technique to increase the prediction performance.

Satellite Image Data
Herein, the solar PV generation prediction model should learn the spatial characteristics of each meteorological factor. Therefore, to extract spatial information, four years of satellite images from 2015 to 2018, from the Communication, Ocean, and Meteorological Satellite (COMS), were provided by the NMSC [30]. The COMS is South Korea's first geostationary multipurpose satellite that provides meteorological and ocean observations and communication services. It was launched on 27 June 2010, from the Guiana Space Center. The COMS takes images of the Korean Peninsula of size 1024 × 1024 pixels and a spatial resolution of 1720.8 m per pixel. Every 15 min, 16 images are taken, including cloud detection, AMV, and surface temperature. In this study, four of the 16 types of images-AMV, COT, AOD, and INS images-were used [31][32][33][34]. Figure 2 shows each sample image at 13:00 on 9 February 2018. Each image's description and methods for spatial information extraction are described in the subsections.

Atmospheric Motion Vector Image and Region of Interest
Clouds and PM significantly influence irradiance, a critical element of solar PV generation. Clouds and PM move along the wind. AMV images were used to show the effect on the spatial movement of clouds and PM. In Figure 2a, the AMV image shows the wind direction and wind speed information with arrows. The wind direction arrows are divided into red, green, and blue according to altitude. However, the AMV image does not provide numerical information on the wind direction vector. Therefore, to extract the wind direction and numerical information on the wind speed, we observed the following sequence. First, we selected the wind direction arrow closest to the target region and located the center coordinates of the wind direction arrow. The angle between the center coordinates and body of the wind direction arrow, as indicated by θ in Figure 3, was calculated to obtain the wind direction. Second, the wind direction can be calculated using the shape of the wing attached to the body of the wind direction arrow.  By setting the target region, where the solar PV power plant for predicting solar PV generation is located, as an ROI, the spatial characteristics of clouds and PM moving according to the wind direction were analyzed. The wind direction arrows in the AMV image rotate 360 • around the center coordinates. Therefore, as the center coordinates of the wind direction arrow were fixed, the ROI is set to 50 × 50 pixels, which is a size that does not interfere with the wind direction arrow rotating with time. Furthermore, the impact on the surrounding region was identified by setting the ROI adj for the eight adjacent directions around the ROI. Figure 4 shows the ROI and ROI adj set in Incheon, Busan, and Yeongam in magenta and cyan, respectively. The COT image represents the thickness of the clouds through the color index in the bottom right corner, and information about the amount and thickness of clouds is extracted. The color indexes from 0 to 100 were divided into quarters and classified into clear, partly cloudy, mostly cloudy, and cloudy. Subsequently, the number of pixels for each index color belonging to the ROI and ROI adj set through the AMV image was identified, and information about the cloud amount and thickness was saved. Similar to the COT image, the AOD image represents air pollutants, such as yellow dust and PM, as a color index. The color index is divided into good, moderate, bad, and very bad, and the PM amount and concentrations in the ROI and ROI adj were saved. Finally, the INS image represents the amount of irradiance reaching the surface using the color index. To extract information about the amount of irradiance reaching the surface, the index information value for each pixel in the ROI was averaged and used. Table 1 shows the information extracted from three satellite images of the ROI in Busan.

Numerical Text Data
To predict the amount of hourly solar PV generation, three categories of numerical text data were used. Meteorological factors, such as temperature, humidity, and precipitation, air pollutants, such as PM 10 and PM 2.5 , and solar PV generation data were used as parameters for predicting solar PV generation. The KMA, Air Korea, and the Open Data Portal provided the data [35][36][37], respectively. The KMA began meteorological observations in 1904 for meteorological stations in 103 regions across the country. Through this, more than 15 types of hourly data, such as temperature, precipitation, and humidity, are provided as public data. The location of the meteorological stations in each area used in the experiment was 37.4777658 lat. and 126.6223456 long. in Incheon and 35.2061563 lat. and 129.0806029 long. in Busan. Yeongam does not have a meteorological station, so the closest location, Mokpo, was used. The location of the meteorological station in Mokpo is 34.8171105 lat. and 126.3789376 long. Herein, temperature, humidity, cloudiness, wind speed, wind direction, precipitation, amount of sunlight, irradiance, and visibility were used as meteorological factors for predicting solar PV generation. Air pollution caused by fossil fuels and the smoke of cars causes serious environmental problems. Increasing the PM concentration in the atmosphere not only harms the human body but also decreases the amount of irradiance by reducing visibility because of the effects of scattering and absorption when sunlight passes through the atmosphere. It significantly reduces solar PV generation. Therefore, Air Korea provided data for SO 2 , CO, O 3 , NO 2 , PM 10 , and PM 2.5 , which were used as air pollution factors for predicting solar PV generation.
Finally, the Open Data Portal provided the most critical hourly solar PV generation data. Furthermore, data of latitude, longitude, and altitude were added to show the geographic information for each solar PV power plant, and facility capacity and installation angle information of solar panels were added to learn facility information. All data were collected for four years from 0:00 on 1 January 2015 to 23:00 on 31 December 2018. The k-nearest neighbors algorithm was used to interpolate missing values among the collected data, and interpolation was performed by learning data for 36 h before and after, i.e., 72 h based on the missing time point. The amount of irradiance, according to the daylight time, determines the amount of solar PV generation; hence, the daylight time of 24 h was set from 09:00 to 17:00. Table 2 summarizes the capacity of each solar PV power plant used in the study and the distance between each station. Table 3 shows a sample of numerical text data from Incheon.

Parameter Analysis
Pearson correlation analysis was conducted to analyze the correlation of parameters used to predict solar PV generation. Furthermore, additional validation was performed to analyze the effect of solar PV generation on clouds and PM of numerical text data provided by KMA and spatial information data extracted from satellite images. For clouds, the numerical text data comprise 0-10 levels, and the data extracted from the satellite image consist of four levels. For PM (Table 4), the numerical text data comprise four levels for both PM 10 and PM 2.5 according to the standards used in South Korea. The satellite image data were also analyzed by dividing them into four levels. To exclude the impact of each parameter as much as possible, when analyzing the effect on clouds, PM 10 , and PM 2.5 were both at a good level, whereas when analyzing the effect on PM, the clouds used only 0-1 levels. Furthermore, the analysis was conducted for 2 h from 12:00 to 14:00, which is noon, when the highest amount of solar PV generation takes place. Figures 5 and 6 show the graph of the correlation analysis results of the amount of solar PV generation for clouds and PM in each region. As the amount of clouds increases or the PM concentration increases, the amount of solar PV generation decreases.   As such, the spatial characteristics of each parameter are critical when learning the characteristics of clouds and PM, which significantly affect solar PV generation prediction. Therefore, spatial characteristics were verified using cloud and PM data extracted from satellite images and wind direction data extracted from AMV images. The verification methods are as follows. First, at time t, recognize the wind direction of the ROI. Next, the cloud and PM amounts are analyzed at time t of the ROI and each ROI adj . Finally, depending on the wind direction, the increase or decrease because of the movement of clouds and PM is determined at the point t + 1 of the ROI. For example, assume that the wind direction is north, and the amounts of clouds in ROI and ROI adj at time t are 5 and 8, respectively. At this time, when the amount of cloud of ROI is >5 at the time point t + 1, it is determined as true, and in the opposite case, it is determined as false. Tables 5 and 6 show the verified results.

Prediction Methods for Solar PV Generation
Various methods were used to predict the amount of solar PV generation. We used SARIMAX, a traditional statistical time-series analysis method, and SVR, a method that applies a loss function to the support vector machine (SVM), a representative classification algorithm. The DNN with high-level prediction performance was used by combining several nonlinear transformation techniques. As a method based on the decision tree method, a random forest model was used. The SARIMAX-LSTM model was used to create a new model by combining only the merits of each model and LSTM, which is easy for classification, processing, and prediction based on time-series data. Detailed descriptions of each method and model are provided in the following subsections.

Seasonal Autoregressive Integrated Moving Average with Exogenous Factors
The autoregressive integrated moving average (ARIMA) is a traditional statistical time-series analysis method developed by Newsham and Birt as a regression model that includes both the autoregressive (AR) model and the moving average (MA) model [38]. The AR model determines whether past data affect future data, and the MA model identifies a trend in which the average value of a random variable continuously increases or decreases with time. As the ARIMA is a univariate time-series model, the ARIMAX can manipulate multivariate time-series data by adding external factors to it. To apply the ARIMAX model, steady-state data are critical. If the data do not have a steady-state, the difference should be used to represent the steady state and then applied to the regression model.
The SARIMAX model adds seasonal characteristics to the ARIMAX model and can reflect the periodicity of the data [39]. The amount of solar PV generation, including the meteorological parameters used in the study, satisfies the steady-state and seasonal periodicity, as it has the characteristics of the four seasons and uses the hourly data. The SARIMAX model has the order of the nonseasonal AR (p), nonseasonal difference (d), nonseasonal MA (q), seasonal AR (P), seasonal difference (D), and seasonal MA (Q) order. In this study, SARIMAX (3, 0, 3) (3, 0, 3, 12) s was used as the order for the solar PV generation prediction model.

Support Vector Regression
The SVM is a representative classification algorithm proposed by Vapnik in 1995 [40]. The SVR method introduces the loss function to SVM for regression analysis. The SVR must obtain an optimal regression function f (x) to minimize the difference between the actual and predicted values. To this end, the loss function reduces the size of the regression coefficient to find a line that flattens the regression equation and then determines all predicted values within a specific deviation ε called the support vector. The smaller the corresponding support vector, the more optimal the regression function f (x) that will be obtained. This is a typical linear regression method, but most data cannot solve the problem using only linear regression; a nonlinear regression equation should be used. The SVR can solve the problem by mapping the data of the existing input space into the feature space and using a mapping function that enables the data to be linearly expressed in a highdimensional space. When data are mapped to a higher dimension, the regression equation becomes complex because of the curse of dimensionality, which significantly increases the computational amount. This problem can easily be solved using kernel functions, such as the radial basis function, linear, and polynomial kernels. The optimal regression function f (x) can be calculated by solving the Lagrangian problem through the dot product of the vector calculated using the kernel function. Herein, a linear kernel with the best prediction performance was used because of experimenting with various kernels of SVR models for solar PV generation prediction.

Deep Neural Network
Machine learning is used for classification and prediction in various fields [41]. The DNN consists of an input layer, a hidden layer, and an output layer, and more complex computation is possible by expanding the number of hidden layers in artificial neural networks (ANN) that mimic the human brain structure. The nodes at each DNN layer are interconnected, hence, they have the same effect as many neurons connected to collect and process multiple data in the human brain structure. By interacting with various nonlinear activation functions, such as Sigmoid, ReLU, and tanh in each DNN layer, the DNN model itself creates labels for each training data or distorts the space to derive optimal classification or prediction results. The conventional ANN method passes through the hidden layer from the input layer and proceeds in one direction to the output layer when calculating weights in a feed-forward method, rendering it impossible to adjust the weights. However, the prediction result's precision can be improved by adopting the backpropagation algorithm, which computes the gradient earlier in the back layer using the gradient descent algorithm. If the number of hidden layers is simply increased to design the DNN model, the gradient might be stuck in the local minima, or a vanishing problem can occur, resulting in lower performance than a shallow ANN. Therefore, if the problem is solved using the dropout layer or applying a nonlinear activation function, higher performance prediction results can be derived by resolving vanishing gradient and overfitting problems. Table 7 shows the structure of the DNN model used to predict solar PV generation in this study.

Long Short-Term Memory
The recurrent neural network (RNN) allows for effective analysis when data in the past have time-series characteristics because it can then consider sequence or temporal characteristics, through which past data can affect the future outcome [42]. Unlike other neural networks, the results of the hidden layer are linked so that they can revert to the input of the same hidden layer and share weights. However, the gradient-vanishing phenomenon, in which gradient values become exponentially smaller during the backpropagation process, and gradient expansion, in which gradient values grow exponentially during the learning process, do not accurately reflect long-term dependencies, and the model cannot proceed with learning.
Hochreiter and Schmidhuber proposed the LSTM, which can solve the long-term dependence problem of the RNN [43]. The LSTM has four layers of interaction, and through cell states, key information continues to be conveyed to the next level. Furthermore, the four layers use each gate element to add or remove various information. The gate that protects and controls the cell state is composed of forget gate, an input gate, and tanh layers, allowing information to flow selectively. It consists of a Sigmoid neural net layer and a point-by-point multiplication operation. The Sigmoid layer outputs a value of 0 or 1 to determine the effect of each component. If the output value is 0, the corresponding component does not affect the future. Conversely, when the output value is 1, the corresponding component influences the prediction result in the future. Table 8 shows the structure of the LSTM model used to predict solar PV generation in this study.

Random Forest
Random forest is an ensemble algorithm that learns multiple decision trees [44]. It is widely used in classification and regression problems because it can easily manage interactions and nonlinearities between parameters and is insensitive to outliers. The work of Yali Amit and Donald Geman [45] influenced the early concept of random forest, and Leo Breiman [46] established the present concept. Random forest can effectively prevent overfitting by adding the randomness of variable selection to the bagging method generating a model by randomly extracting a sample several times and iterating the restoration. It has high prediction stability because the average of the prediction results is used for numerous decision trees, and the optimal prediction value is derived by selecting the optimal decision tree model through a majority vote. Although prediction using a decision tree has a disadvantage because the prediction result or model performance fluctuates significantly, the randomization technique, which is a characteristic of the random forest, overcomes the disadvantage of the decision tree and has good generalization performance. The conventional random forest may be possible to cause the problem of concept drift, which deteriorates the performance of the predictive model over time. Hence, Zhukov et al. attempted to solve this problem [44]. In this study, 500 decision trees were used in the Random Forest model for solar PV generation prediction.

Ensemble Learning (SARIMAX-LSTM)
The key of ensemble learning is to achieve better generalization performance than individual weak learners by combining multiple single models to create one strong learner [47,48]. Representative ensemble techniques are classified into three methods. First, the bagging technique using the voting method randomly restores and extracts the target data. Using the extracted data as a sample group, the prediction results are aggregated as an average value after training each model, reducing errors in overfitting and underfitting caused by high variance or high bias. Second, the boosting technique using the weighted voting method applies weights in the restoration extraction process, unlike the bagging technique. Although the bagging technique proceeds with training in parallel, the boosting technique sequentially progresses; hence, weights are redistributed according to sequentially derived results in the training order with high accuracy. However, it has the disadvantage of being vulnerable to extreme outliers. Lastly, the stacking technique derives the performance of a new model by combining the advantages of different individual models. It adopts the characteristics of each model to highlight its advantages, complementing its disadvantages, which can improve performance over a single model.
In this study, the stacking ensemble was used among various ensemble methods and the SARIMAX and LSTM models were used as weak learners to sequentially combine. This is to emphasize the time-series characteristics of various parameters, including meteorological factors, and solve the long and short-term dependency problem. Figure 7 shows the structure of the proposed SARIMAX-LSTM model. After the original data are derived from the SARIMAX model, the first result is derived, and the final predicted value is derived using it as the training data of the LSTM model.

Error Analysis for Prediction
Various methods exist to verify the error of the prediction model and can be classified into two methods: a relative error verification method and an absolute error verification method. Representative relative error verification methods are the mean square error (MAE) and the root mean square error (RMSE). The mean absolute percentage error is mainly used as an absolute error verification method. However, when the measured value is 0, it becomes infinite or undefined, and as the measured value converges to 0, it diverges to the limit. It also has the disadvantage of distorted results when there are many extreme outliers. In this study, the symmetric mean percentage error (SMAPE) was used to overcome these shortcomings. Each error verification method is expressed as Equations (1)-(3), and a value closer to 0 indicates that the model has superior performance. Using the criteria of the American Society of Heating, Refrigerating, and Air-Conditioning Engineers (ASHRAE) Guideline 14 applied by energy managers to improve energy efficiency, we will additionally verify the performance of the solar PV generation prediction model [49]. For the objective evaluation of the solar PV generation prediction model, the mean bias error (MBE) and the coefficient of variation (Cv) criteria in the ASHRAE Guideline 14 were applied and are expressed as equations 4 and 5. For MBE, the performance increases as it converges to 0, regardless of the ± sign. However, in this study, absolute values have been taken for the results, thereby increasing intuition and convenience of comparison. From Table 9, according to the criteria of ASHRAE Guideline 14, the hourly prediction is defined within MBE ± 10% and Cv 30%.
F: Forecast value, A: actual value, n: number of samples.

Cloud and PM Prediction for Solar PV Generation
Before predicting solar PV generation, clouds and PM are first predicted to reflect their spatial characteristics. During the entire experimental period, 2015-2018, the clouds and PM in the ROI and ROI adj were learned using satellite images data from 2015 to 2017. It then predicts the hourly cloud and PM of ROI in 2018. To predict clouds and PM, data extracted from satellite images and numerical text data for meteorological factors and air pollutant factors were combined and used. The LSTM model for clouds and PM was used differently from the solar PV generation prediction LSTM model. Tables 1 and 3 show the input parameters. Here, 15 parameters are used in Table 3, excluding the solar PV power plant's facilities and geographical factors. Table 10 shows the structure of the LSTM model used to predict clouds and PM in this study. Table 11 shows the prediction results.

Proposed Model for Solar PV Generation
To predict hourly solar PV generation, the prediction model is learned using various meteorological parameters, including the predicted cloud amount and PM. Furthermore, to reflect the temporal characteristics in the prediction model, variables representing time, such as the month, day, and time, were added. To predict the amount of solar PV generation, the 2018 data were divided into training, verification, and test data ratio of 3:1:1 for each month. Five models were used for prediction: SARIMAX, SVR (Line kernel), DNN, LSTM, Random Forest, and SARIMAX-LSTM. Table 12 shows the parameters for forecasting the amount of solar PV generation.

Experimental Results
To compare the performance of the single-site and multisite solar PV generation prediction models, 21 of 36 parameters were validated, excluding the facilities and geographic parameters of a single-site solar PV generation prediction model used in the results of a previous study [23]. Table 13 shows the results of the evaluation by applying the data of three regions to the previous study, the single-site solar PV generation prediction model. Based on the absolute evaluation method SMAPE, the prediction performance was excellent in the order of DNN model, ARIMAX model, SVR_Linear model, SVR_RBF model, and ANN model. Among all five models, the ARIMAX, which manages multivariate time-series data, was the best in all error verification methods, except the SMAPE and MBE. The ARIMAX model predicts by showing the time-series characteristics; hence, it has a certain level of predictive performance, but does not have optimal performance. The SVR_Linear model, including the ARIMAX and DNN models, shows satisfactory performance, whereas the ANN model shows severe performance degradation. However, all five models did not meet the criteria of ASHRAE Guideline 14.

Discussion
The single-site solar PV generation prediction model has limitations when using multisite data. The ARIMAX model shows the multivariate time-series characteristics in a single-site solar PV generation prediction model, and the SARIMAX model in a multisite solar PV generation prediction model, show higher performance than the other models but do not fulfill the criteria of ASHRAE Guideline 14. The performance of the single-site solar PV generation prediction model using multisite data set is similar to the performance of the multisite solar PV generation prediction model but does not have the optimal results because the single-site solar PV generation prediction model cannot learn on several factors, including the facility and geographic information of the solar PV power plants included in the multisite data. To improve the performance of the proposed model, finding and improving the factors hindering the prediction performance is necessary. The inhibitory factor is deemed the missing value of the AMV data. In the preprocessing step, after recognizing the wind direction arrow image of the AMV image, one must proceed to the next step. However, in this case, if there are no wind direction data in the ROI in the entire AMV image, the corresponding time zone is recognized as a missing value because there is no wind direction arrow. Therefore, if the number of missing values can be reduced when using various interpolation methods or extracting satellite image data using other methods, more improved models could have better performance.

Conclusions
This study proposed an advanced multisite integrated hybrid spatio-temporal solar PV generation prediction model by combining time-series-based meteorological numerical text and satellite image data with spatial information to develop a precise and accurate prediction model for solar PV power plants in multiple regions. The existing data provided by the KMA contain time-series characteristics but do not reflect the spatial characteristics of clouds and PM moving according to the wind direction. Therefore, data on clouds and PM moving according to the wind direction were extracted using satellite images to show the spatial characteristics together. It predicted the solar PV generation of existing solar PV power plants in both single and other regions. The data from 2015 to 2018 were used for three solar PV power plants in Incheon, Busan, and Yeongam in South Korea. To reflect the spatial characteristics of clouds and PM, the data from 2015 to 2017 were learned in order to predict the number of clouds and PM in 2018 first, and the amount of solar PV generation in 2018 was predicted using the predicted cloud and PM data. To develop the optimal prediction model, SARIMAX, a traditional time-series analysis method, and SVR_Linear, DNN, LSTM, Random Forest, and SARIMAX-LSTM models based on machine learning algorithms were used.
Consequently, the overall performance increased compared to the single-site solar PV generation prediction model. For the SARIMAX-LSTM model to which the stacking ensemble technique was used to make the most of the temporal characteristics of the solar power generation data, the results were MAE: 64.730; RMSE: 95.800; SMAPE: 19.891; MBE: 2.650; and Cv: 29.923. Among the proposed models, it is the only model that satisfies ASHRAE Guideline 14 and showed the best performance.
The proposed advanced multisite integrated hybrid spatio-temporal solar PV generation prediction model can predict integrated solar PV power generation for solar PV power plants in various regions in South Korea using numerical text data and satellite images. Therefore, it enables the prediction of solar PV generation for both existing and newly constructed solar PV power plants. By learning the facility and geographic information of each solar PV power plant, and the meteorological and air pollutant data of the area where the solar PV power plant is located, the amount of solar PV generation can be predicted. This reflects the spatio-temporal characteristics of solar PV generation, thereby providing guidelines for developing a precise and accurate solar PV generation prediction model for a stable power supply and demand plan.