Analysis of Meteorological Factor Multivariate Models for Medium- and Long-Term Photovoltaic Solar Power Forecasting Using Long Short-Term Memory

: Solar power generation is an increasingly popular renewable energy topic. Photovoltaic (PV) systems are installed on buildings to efﬁciently manage energy production and consumption. Because of its physical properties, electrical energy is produced and consumed simultaneously; therefore solar energy must be predicted accurately to maintain a stable power supply. To develop an efﬁcient energy management system (EMS), 22 multivariate numerical models were constructed by combining solar radiation, sunlight, humidity, temperature, cloud cover, and wind speed. The performance of the models was compared by applying a modiﬁed version of the traditional long short-term memory (LSTM) approach. The experimental results showed that the six meteorological factors inﬂuence the solar power forecast regardless of the season. These are, from most to least important: solar radiation, sunlight, wind speed, temperature, cloud cover, and humidity. The models are rated for suitability to provide medium- and long-term solar power forecasts, and the modiﬁed LSTM demonstrates better performance than the traditional LSTM.


Introduction
Recently, considerable research has been conducted on low pollution, renewable energy sources to address carbon emissions and environmental problems, including the Republic of Korea's "Implementation Plan for Renewable Energy 2030" [1]. This proposed strategy would increase the country's share of renewable energy from 7.6% (15.1 GW) in 2017 to 20% (63.8 GW) in 2030 by encouraging the use of clean energy such as solar, wind, hydropower, biofuels, and waste recycling. Among these energy sources, solar power generation is one of the most widespread renewable energy industries and has been used extensively as an alternative to existing power generation methods. The solar power supply was expected to account for 30% (5.7 GW) of the country's renewable energy in 2017, and supply 57% (36.5 GW) of renewables by 2030. Photovoltaic (PV) power generation is the most efficient among renewable energy resources for expanding small-scale, distributed power supplies, and it is expected to continue to grow because its power generation costs are decreasing the fastest [1,2]. Power supply utility companies operate large, centralized power plants to meet supply and demand for electricity. The utilities have generators to supply baseload power, and additional capacity to meet peaks in power demand. Most have the option to import additional production from other power companies if necessary, to meet high demand loads, resulting in some clear limitations. Some power plants can take a long time to come online at maximum production, while others are expensive to operate and are only used for peak-shaving. The demand on some plants can at times be greater than the total power plant output. Power supply companies try to prepare for surges and declines in electrical demand by controlling the energy production and transmission system through the use of predicted power consumption, and operating via demand response (DR) that helps to effectively reduce peaks [3,4]. The energy management system (EMS) controls energy in all areas of the power system, including power generation, transmission, distribution, and consumption. The EMS software has been adapted for buildings (BEMS), factories (FEMS), and homes (HEMS). The purpose of the EMS is to ensure efficiency in energy use and operation, although individual systems and details differ depending on the area of use. Until recently, electric power systems have consisted of centralized power generation and one-way, distributed consumption, with little more than experience and guesswork to manage and operate energy grids. The development of smart grids incorporating various types of EMS aids intelligent and efficient operation of power systems. The prediction of small-scale, distributed solar power generation that may feed into these electrical grids is very important [5,6].
For solar power forecasting approaches, different methodologies are preferred, depending on factors such as forecasting horizons, model inputs, and data characteristics. These methods can be broadly divided into three categories: physical, data-driven, and hybrid approaches.
(1) The physical approach forecasts solar power using mathematical modeling that takes into consideration weather data (air pressure, temperature, humidity, jet streams, etc.) and environmental characteristics (orography, topography, land use, etc.) [7][8][9]. This approach requires a large volume of data, because the accuracy of the prediction increases in proportion to the amount of data. As such, the predictive model is often difficult to construct, and the resulting model structure is complicated because it must mathematically deal with many variables. The calculation process is complicated and requires a long computation time. The most widely used application for physical models is numerical weather prediction (NWP) [10][11][12], which is more suitable for long-term forecasting than for short-term and medium-term forecasting because of the large number of computational resources utilized. (2) The data-driven approach predicts solar power through pattern analysis, which is trained on past data in time increments, such as 15 min, one hour, and daily units. This method is suitable for predicting solar power values within a short period of time as the amount of past data increases. Data-driven approaches can be either statistical or artificial intelligence methods. Statistics can obtain high accuracy in short-term forecasting, but cannot accurately forecast long-term solar power due to the progressive accumulation of errors. An example of a statistical approach is the auto regression integrated moving average (ARIMA) model, a time series analysis model [13] that has many disadvantages. Several approaches have been suggested to work around these limitations [14,15]. In contrast, artificial intelligence includes neural networks [16], fuzzy inference [17], particle swarm optimization [18], genetic algorithms [19], support vector machines [20], and deep learning [21], which utilize long short-term memory [22], gated recurrent units [23], autoencoders [24], and convolutional neural networks [25]. Artificial intelligence approaches have superior performance for general purposes but are limited because the relationships between model elements cannot be accurately explained. (3) Hybrid approaches produce predictions by applying a statistical technique after using a physical approach. This method combines the advantages of physical and statistical approaches. In other words, traditional physical approaches are more suitable for long-term forecasting because of their large scale, while the data-driven approaches are accurate for short-term forecasting but accumulate errors in long-term forecasts. Traditional hybrid approaches suffer from many errors, and studies are underway to improve their predictions [26][27][28][29]. Recently, hybrid approaches have been developed by combining machine learning and fuzzy theory, machine learning and deep learning, and multiple types of deep learning [30][31][32][33].
The objectives of this study can be summarized as follows: (1) A total of 22 numerical models were constructed for correlation analysis to assess the influence of meteorological factors such as solar radiation (SR), sunlight, humidity, temperature, cloud cover (CC), and wind speed (WS) on PV solar power forecasting. (2) Traditional long short-term memory (LSTM) is limited when predicting solar power generation in the time step between predictions, because small inaccuracies from the previous step compound and continuously increase the prediction error. The goal of the modified LSTM was to obtain the observed value from the time interval between predictions, and use the observed value instead of the predicted value to update the network for solar power generation forecasts.
The correlation analysis between solar power and meteorological factors was verified by collecting data on buildings located in Ansan (Location "A" in Figure 1), Gyeonggido, Korea, from April 2018 to March 2019 in one-hour increments. Weather data for accurately forecasting solar power were collected by attaching sensors to the building in Ansan; however, this building is located in an industrial complex that has a temporary installation for solar power generation and has no relevance for weather data. Therefore, the meteorological data were collected from the closest meteorological observation point at Suwon (Location "B" in  The organization of this study is as follows: Section 2 describes the correlations between solar power and meteorological factors based on data from April 2018. Recurrent neural networks (RNNs) and LSTM, which have superior performance among the artificial intelligence approaches, are explained and the limitations of traditional LSTM are introduced. Section 3 describes the solar PV data set and the 22 models used to verify the proposed approach. Section 4 presents the results from the analysis of the monthly and annual average solar power calculations for each of the models, and Section 5 outlines findings, conclusions and plans for future work. Abbreviations mentioned in this article are described in Table A1 of Appendix A.

Correlations between Solar Power and Meteorological Factors
Previous studies have indicated that the production of electricity from solar power plants is closely related to the amount of available sunlight [35,36]; in other words, the most important consideration for solar power generation is the amount of sunlight received by the PV panel. Therefore, more power is generated during the summer with long sunshine duration than is generated during the winter. There is also a close correlation between the amount of sunshine per day and dust pollution, such as that from yellow sand, and automobile pollution (smog). Solar power generation is therefore slightly higher in May, June, September, and October than in July-August when the sunshine duration is long but the amount of power generated is reduced due to high dust and smog levels when temperatures exceed 30 • C.
Scatterplots of the correlations between the meteorological factors and solar power (SP) in April 2018 are shown in Figure 2, and a comparison of the performance of the correlations between SP and various meteorological factors is listed in Table 1, where the best values are bolded. The performance of these relationships is evaluated based on different statistical measures, such as the sum of squared error (SSE), correlation coefficient (R), coefficient of determination (R 2 ), and root mean square error (RMSE). The correlation indicates the degree of association between the two variables. Rs approaching +1 indicate strong positive correlations, those approaching −1 indicate strong negative correlations, and those approaching 0 indicate an almost negligible linear relationship. The smaller the SSE and RMSE, the better the estimated regression line.   Table 2 shows statistical data between SP and meteorological factors. For statistical data, minimum, maximum, mean, median, standard deviation, and range were adopted. Among these data, the standard deviation is a number representing the distribution of the data, and the smaller the standard deviation, the closer the variables are to the mean value. Therefore, the results show that the relationship between SP and SR is the strongest, followed by the relationships between SR and sunshine, humidity, temperature, WS, and CC.

PV System and Measured Method
A PV system is designed to supply usable SP by means of PV cells. The building structure is H-BEAM with sandwich panels for the exterior walls, and the PV system capacity is 150 kW. Figure 3 shows the proposed wiring diagram of the SP system. The solar panel is in series and parallel, and is a string inverter type with three inverters (50 kW × 3). PV power is measured with a 3P/4W electronic watt-hour meter. The measurement cycle is measured in units of 15 min but was collected in units of 1 h in this study.

Recurrent Neural Networks
In machine learning, RNNs are mainly used for analysis of time series data and sequence data [36]. The input of the multilayer neural network is activated only in the output direction, and the node of the hidden layer cannot utilize the past information (input); however, the RNN has a circular internal structure, as shown in Figure 4, which allows past information to be used. The RNN receives and processes the elements constituting time series data or sequence data one at a time, creates a corresponding internal node according to the input time, and stores the information processed up to that point. Information stored in this way can be output through the output node. Because of these characteristics, RNNs have been used in fields such as speech recognition, language recognition, and handwriting recognition with good results [37].  In Figure 4, the weight between the input layer and the hidden layer is represented by U, the weight between the hidden layer and the output layer is represented by V, and the weight between the hidden layers is represented by W. Since RNNs share weights, they have the same weights U, V, and W at all times. The weight-sharing function improves prediction ability, saves computation time, and has an advantage when the number of inputs is variable: In Equations (1) and (2), x t is the value of the input layer at time t, h t−1 is the value of the hidden layer at time t − 1, and o t is the value of the output layer at time t. The activation functions used in h t and o t are the hyperbolic tangent function (tanh) and the sigmoid (σ) function, respectively.

Long Short-Term Memory
RNNs have the advantages described in the previous section as well as the following limitations. (1) As shown previously in Figure 4, multiple neural networks are connected, and the same weight is multiplied several times, resulting in a gradient explosion problem and a gradient vanishing problem [38].
(2) RNNs reflect information from the past, but as time progresses the previous information arrives in a weakened state, making it difficult for the network to incorporate older data. (3) Finally, RNNs have difficulty distinguishing between continuously delivered information and unnecessary data that should be deleted. Therefore, an LSTM that uses a new "gate" concept to solve the gradient vanishing problem, and the information transmission problem can be utilized [39]; this gate decides whether to remember or delete the current data. Representative studies of LSTMs include handwriting recognition [40], handwriting generation [41], speech recognition [42], machine translation [43], and image captioning [44,45]. Figure 5 shows the structure of the LSTM. The interior of the LSTM block consists of a cyclic memory cell and three types of gates (input, forget, and output gates). The LSTM calculates the final output through a hidden layer like the RNN but controls the flow of information by appropriately using the three gates during the calculation of the hidden layer. The input gate controls the amount of the current input state (c t ) flowing to the current cell state (s t ). The forget gate controls the amount of the cell state before time t (s t−1 ) flowing to the cell state at time t (s t ). The states of the LSTM cells are computed as follows in Equations (3)-(5) [46,47], where g t , f t , and q t are the input, forget, and output gates, respectively: The c t is calculated by applying U c , W c , and b b to the input data at time t and the hidden layer (h t−1 ) at time (t − 1), respectively. The s t calculates the current information using c t , g t and f t . The hidden node (h t ) is calculated using s t and q t . Finally, the current output state (o t ) multiplies the hidden layer by V.
In Equations (6)-(9), U g , U f , U q , U c , W g , W f , and W q are the weights of the input value x t and the previous hidden layer (h t−1 ) at time t for each gate, respectively; b g , b f , b q , and b c are the biases of g t , f t , and q t , and c t , respectively. Figure 6 shows the solar power forecast results using a traditional LSTM for January 2019. The traditional LSTM uses the previous prediction value in the time step when forecasting solar power; if the prediction value is incorrect, the forecast value will continuously increase and, the prediction error will increase. Figure 6a shows the comparison between observed and predicted value and Figure 6b shows that prediction error of the observed and predicted values. Therefore, since the observed value is important for the prediction, the weight is added to the observed value instead of the previous data in this study, and the network state is then updated for the forecast.

Proposed Approaches
This section explains the test dataset of SP, proposes 22 models for SP and weather data, and describes the proposed method in detail.

Test Dataset
In this study, data were collected from a building located in Ansan, Gyeonggi-do, Korea. The collection period was from April 2018 to March 2019, and SP generation and meteorological data were collected every hour. Table 3 shows the monthly data collected and the total data collected for one year.  Table 4 shows the 22 multivariate numerical models constructed by combining six meteorological elements-SP, SR, sunlight, CC, WS, humidity, and temperature-that have the greatest effect on solar power. In this study, performance tests were conducted on the traditional and proposed LSTMs with these 22 models.

Proposed Method
In the following sections, the pre-processing step is outlined for data standardization and division into training and test data. The proposed LSTM is then described, and a post-processing step is applied to the simulation results.

Data Pre-Processing
Data collected on a monthly and yearly basis are divided into training (80%) and test (20%) data. To prevent divergence in the training set, the training data were standardized as shown in Equation (10) so that the average is 0 and the variance is 1. The test data were standardized as shown in Equation (11) at the time of prediction using the same parameters as the training data: In Equation (10), Train i is the training data, and Train mean , Train sig , and Train std are the mean, standard deviation, and standardization of the training data, respectively. In Equation (11), Test i is the test data, and Test mean , Test sig , and Testn std are the mean, standard deviation, and standardization of the test data, respectively.

Model Setting and Training
The traditional LSTM should not affect the prediction of new data because the previous predicted value affects the prediction of new data. Therefore, to predict each sequence, the neural network state is reset. This prevents previous predictions from affecting new data predictions. Since the observed value of the time step between predictions can be accessed, the neural network state is updated using the observed value instead of the predicted value using Equation (12): In Equation (12), since the previous cell state and forgetting gate did not reflect changes to the current input data well, many errors occurred. Therefore, if the weight (α = 0.8) is added to g t and c t rather than to the previous cell state and f t , it is determined that the performance will be excellent because the current data are not lost.
For training and testing, the model was set up as follows because it has excellent performance: the hidden LSTM layer consisted of 200 blocks, the maximum number of iterations was fixed at 250, the initial learning rate was 0.005, the activation function was the ReLU (rectified linear unit) [48], and the optimization was "Adam" [49].

Post-Processing
Standardization was used to prevent divergence of the training and test data in the pre-processing stage, and the standardized training and test data are replaced with the original values using Equations (13) and (14), respectively: Test pred = Test sig × Test pred + Test mean . Figure 7 shows the solar power forecast results using a proposed LSTM for January 2019. Figure 7a compares between observed and predicted values and Figure 7b predicted error between observed and predicted values. As shown in Figure 7, the proposed method is more accurate than the traditional LSTM used in the experiment. The proposed LSTM can adapt to fluctuations in the observed data, which improves its accuracy and performance. Additionally, the prediction error of the traditional LSTM is 6.68, while the prediction error of the proposed LSTM is 2.07. Therefore, the proposed LSTM method is more effective and adaptive for solar power forecasting.

Test Environment and Metrics for Evaluation
To verify the traditional and proposed method, experiments were performed on a PC equipped with an Intel Xeon (R) E-2136 3.31 GHz CPU and 32 GB RAM. The test operating system was Windows 10 (64 bit), and the experimental program was the deep learning toolbox and the statistics and machine learning toolbox supported by MATLAB R2019a [50]. Finally, the MAE (mean absolute difference), RMSE, and computation time (ms) were adopted to evaluate the solar power forecasting error, as shown in Equations (15) and (16): where y i andŷ i indicate the observed and predicted values, respectively, and n is the number of test data points.

Performance Comparison between the Monthly and Annual Averages
The weather in Korea 30 years ago was less unusual than it is now, with spring from March to May, summer from June to August, autumn from September to November, and winter from December to March. However, Korean weather patterns have now compressed the period of spring from April to May and autumn from October to November, while the summer season (June to September) and winter season (December to March) has become longer. As shown in Figure 8, the greatest to least impact of the meteorological factors on solar power was SR > sunlight > WS > temperature > CC > humidity [51]. Thus, while five of the meteorological factors had a substantial impact on solar power generation, humidity does not significantly affect the month-to-month solar power generation forecast. Table 5 shows the results of the proposed LSTM method for a multivariate model combining the six considered meteorological factors (with the RMSE < 1.5). Among the 22 multivariate models, model 16 was the best at accurately predicting solar power regardless of the season. The input parameters of model 16 are SP, SR, sunlight, and WS. Model 1 has the advantage of a short computation time because it has only a few input parameters; however, since the model learns with previous SP data values, it cannot accurately predict solar power. As shown in Table 5, humidity and WS do not significantly affect the solar power forecasts individually; however, models 19, 21, and 22 have smaller RMSE values owing to the correlation between humidity and wind speed. The results in Table 5 indicate that the models cannot accurately predict SP given the seasonal weather conditions in Korea, but model 16 is better for long-term solar power forecasting. The models with more checkmarks in the columns of Table 5 are better for medium-term solar power forecasting, depending on the regional and climatic conditions.  Tables 6 and 7 are the performance comparisons of the traditional LSTM and the proposed LSTM, respectively, for medium-term and long-term solar power forecasting by the models. Table 6 is the result of comparing the performance of medium-term solar power forecasting for April, which represents the spring season in Korea. ∆ is the difference between traditional LSTM and proposed LSTM. The medium-term solar power forecasting results for August, October, and January, included in Tables A2-A4 of Appendix B, represent the Korean summer, autumn, and winter, respectively.   Figure 11 shows the best and worst models for the proposed and traditional LSTMs by month (January, April, August, and October), respectively. When compared seasonally, models 2 and 6 show the best performance in the traditional and proposed methods, whereas models 14, 17, and 19 have the worst performance. Figure 12 compares the RMSE values and the differences between the monthly and yearly averages determined by the models. Figure 12a,b shows that comparison between the monthly and yearly average by models and difference the monthly and yearly average. As shown in Figure 12a, the RMSE values of the monthly and yearly averages are almost similar; however, models 1, 5, 6, 8, and 10, which present a yearly difference of more than 0.5 compared to the monthly average, should not be used, especially in long-term solar power forecasting. Figure 13a,b shows that comparison of the computation times for the monthly and yearly averages for each model and difference between monthly and yearly average, respectively. The total amount of data in the experiment per year and month is the same, but the computation time for the yearly average is longer than that for the monthly average. This is because the amount of experimental data for the year is larger than that for each month, requiring a longer computation time. Additionally, the computation time increases as the number of input parameters increases.

Conclusions
Among the various renewable energy sources, solar energy is being increasingly supported by the Korean government. Solar power systems have been installed in numerous locations such as homes, buildings, and factories, and electricity has been supplied accordingly. Forecasting solar power is essential for a smooth, reliable power supply; however, these forecasts are strongly influenced by meteorological factors. This study analyzed six meteorological factors with the goal of accurately predicting solar power. A total of 22 multivariate models were proposed to assess various combinations of the six meteorological factors: solar radiation (SR), sunlight, humidity, temperature, cloud cover (CC), and wind speed (WS). Analysis of the multivariate numerical models included modifications of the traditional long short-term memory (LSTM) method and application of the proposed LSTM.
The main results are as follows. (1) The six meteorological factors affect the solar power forecast in the following descending order: SR, sunlight, WS, temperature, CC, and humidity. SR has the greatest influence on solar power forecasting, while humidity has effectively no influence on the solar power forecast. (2) Among the 22 models, model 16 is superior, regardless of the characteristics of abnormal weather in Korea. The meteorological factors incorporated in model 16 are SP, SR, sunlight, and WS. (3) The remaining models were ranked according to medium-and long-term solar power forecasting accuracy. Models 6,7,9,13, and 19 showed reasonable performance for long-term forecasting, whereas models 1, 5, 8, and 10 were not suitable. All models except 1, 5, 8, and 10 are appropriate for medium-term solar power forecasting. (4) The root mean square error and the mean absolute difference of the proposed LSTM are superior to those of the traditional LSTM. (5) The calculation time increases as the number of variables in a model increases. (6) The proposed medium-to long-term photovoltaic (PV) solar power forecasting contributes to efficient power consumption, demand response, and a smooth power supply for buildings.
The limitations of the article are as follows. (1) Attaching the weather sensors (temperature, humidity, WS, SR, etc.) to the solar installation site to acquire meteorological and solar power data simultaneously will improve the accuracy of the solar power forecasts. However, this research was limited to collecting meteorological data at an established government weather station some distance from the site. (2) The proposed algorithm has been applied to the various sites for less than a year; however, seasonal measurements could not be repeated.
Future work will collect meteorological data simultaneously at the site where PV is installed, and measurements will extend for more than a year. The methodology will also be applied to other solar power plants to verify the proposed model.  Acknowledgments: This article was supported by research fund from Honam University, 2020.

Conflicts of Interest:
The authors declare no conflict of interest.