Weather Data Mixing Models for Day-Ahead PV Forecasting in Small-Scale PV Plants

As a large number of small-scale PV plants have been deployed in distribution systems, generation forecasting of such plants has recently been gaining interest. Because the PV power mainly depends on weather conditions, it is important to accurately collect weather data for relevant PV sites to enhance PV forecasting accuracy. However, small-scale PV plants do not often have their own measuring apparatus to get historical weather data, so they have used weather datasets from relatively nearby weather data centers (WDCs). Therefore, these small-scale PV plants have difficulty delivering robust and reliable forecasting accuracy because of inappropriate predicted weather data from a distance. In this paper, two weather data mixing models are proposed: (a) inverse distance weighting (IDW), and (b) inverse correlation weighting (ICW). These models acquire adequate mixed weather data for the day-ahead generation forecasting for small-scale PV plants. Furthermore, the mixed weather data are collected using the geographic distance between the PV site and WDCs, or correlation between the PV generation and weather variables from nearby WDCs. Interestingly, the proposed ICW model outperforms when WDCs are located distant from the PV plants, whereas IDW performs well with the closer WDCs. The forecasting performance of the proposed mixing models was compared with those of the existing weather data collection methods.


Introduction
Different distributed energy sources, such as photovoltaic (PV) power generation, wind power generation, and energy storage devices, are the essential components of a modern smart grid. Among them, PV power generation is sharply increasing worldwide because of its convenient installation, the increasing demand for clean energy, and to meet other priorities set by the different countries [1]. To maintain huge clean energy demand, large-scale and small-scale PV plants require to be installed in various geographical regions [2]. Large-scale PV plants usually have their meteorological offices to record historical weather data for short-term PV forecasting (STPF) operations. However, small-scale PV plants are generally less economical and may not have the facilities for weather data collection. For their STPF operation, a weather dataset is often used from the closest weather data center (WDC), where the distance and meteorological weather conditions are often offset.
To date, many approaches have been proposed for STPF, such that they are entirely dependent on implementing the predicted weather conditions [3]. Overall, these approaches can be categorized into (a) statistical, (b) learning-based, and (c) hybrid. Statistical approaches, such as the ordinary least square model [4], vector auto-regression model [4], gradient boosting [4], sine cosine algorithm [5], extreme learning machine [5,6], and empirical mode decomposition [6], perform predictions using only historical PV data. These models assume that the PV series emulates the behavior of a nonstationary time series, and the predicted PV generation is simply based on historical observations and seasonality modeling [7]. Learning-based methods are gaining more interest because of their simple implementation and valid forecasting accuracy. These include both machine and deep learning algorithms, such as support vector regression [8], back-propagation neural network (BPNN) [9][10][11], long short-term memory (LSTM) networks [12][13][14], and convolutional neural networks (CNNs) [15]. BPNN, LSTM, and CNN are state-of-the-art deep learning algorithms that facilitate convenient implementation in the PV forecasting sector. However, they present higher forecasting accuracy only when the training set of each model comprises either a sufficient amount of data or homogenous input data. Because many small-scale PV plants have recently been installed, the collection of a large amount of historical PV data is a major issue that requires to be resolved. Consequently, the recent STPFs follow a hybrid mechanism to obtain better PV forecasting outputs.
Hybrid PV forecasting methods mainly focus on either specific or similar day identification [16][17][18][19][20][21][22]. Euclidean distance (ED)-based optimization [16], statistical analysis software (SPSS) [17], discrete wavelet transformation (DWT) [19], radial basis function [20], and wavelet packet distribution (WPD) [21] are the statistical techniques used for the determination of a specific type of day from historical days. Moreover, learning-based clustering techniques, such as support vector machines (SVMs) [19], k-nearest neighbors (k-NN) [17], and self-organizing map classification [14,22], are the major classification approaches that extract similar weather days. A strong similar day detection (SDD) method that deals with different impacts of weather variables for collecting homogenous PV generation profiles was developed [14]. The common aim of these hybrid techniques is to determine a similar day from the historical data for better forecasting outputs.
Based on the literature review, many PV forecasting approaches use the predicted weather data that are either obtained from the same location or from the closest WDC. The dependency on the closest WDC is one of the major reasons for encountering higher forecasting errors, particularly in small-scale PV plants. To overcome such issues, exploiting the maximum weather information is necessary for all accessible WDCs. These WDCs may be deployed in various geographical regions within a certain distance. In [23], passing cloud issues were introduced and resolved based on the geographical region for multiplant PV forecasting. Wind speed is used to estimate the spatial correlation for modeling PV plants in new locations [24]. However, unusual geographical structures and different weather conditions may occur within this distance. In fact, the procedure for collecting weather data might capture inadequate observations because of the distance. With distance, cloud movement and windy activities vary, typically in hilly or mountainous areas. The optimum tilted angle of the solar panel also has a significant impact on PV generation within a city area [25]. In addition, elevation and humidity often affect PV generation, and they differ from place to place [26]. Consequently, a novel methodology needs to be developed to overcome this problem while collecting weather data from distant WDCs for small-scale PV plants.
The objective of this study was to investigate the feasibility of attaining a higher forecasting accuracy for small-scale PV plants by collecting a more reliable weather dataset from all the accessible WDCs. In this study, two weather data mixing models were proposed to solve the distance problem while collecting weather data from WDCs. These mixing models include the (1) inverse distance weighting (IDW) model, and (2) inverse correlation weighting (ICW) model, and they incorporate a computation of the mixing of weather data from all the reachable WDCs. The proposed IDW model collects the potential weightage weather data from all the WDCs. This weightage is calculated by inverting the distance between the PV plant and WDC. The proposed ICW model computes the weightage for all the WDCs that is similar to the IDW model, where ensemble correlation is used instead of distance. Two existing weather calibration methods are used for comparison: (1) raw data from the closest WDC, and (2) average weather data. Although the averaged weather data utilizes the weather data from all the accessible WDCs, it is considered to be a conventional method.
After the weather data were collected based on all the methods, the existing SDDbased day-ahead PV forecasting technique [14] was used for comparison. The output of the SDD provides four new PV series for the training process. Each new PV series is assumed to be homogenous and passed through the LSTM-based forecasting network. In this study, four PV plants were used to compare the performances that were deployed in different geographical regions. Because the distance impacts the weather data collection, only accessible WDCs are selected within the confined distance. The simulation results verify that the proposed IDW and ICW models collect useful weather data and enhance the forecasting output for all the PV plants. These models are particularly significant for all types of new and old small-scale PV plants that are installed and deployed in remote and diverse geographical regions.
The rest of the paper is organized as follows. Section 2 discusses the issues in the existing weather data collection for the STPF technique. Section 3 explains the proposed weather data mixing models, along with the existing SDD-based PV forecasting algorithm. Section 4 discusses the PV plants and weather data, hyperparameter tuning, and simulation results. Finally, Section 5 concludes the paper.

Weather Data Collection Problem
Despite the impact of various factors, including the formation of clouds from dense water vapors, the passing clouds and windy activities are considered trivial for PV fluctuations. These issues play a key role when there is a significant distance between the PV plants and WDCs. WDCs are mostly located in and around the city area because of higher public activities and cost factors. Because the earth surface and weather conditions are spatially variable, these factors are easily blocked or diverted. In fact, the collection of weather data from the closest WDC is unsuitable for the forecasting task. If hills and mountains are between the WDCs and the PV plant, the collected weather data are considered to be inadequate. Figure 1a,b show the conventional and proposed STPF approaches for the small-scale PV plant, respectively. In Figure 1a, the weather data are directly used from the closest WDC for the STPF task. In Figure 1b, weather data from four accessible WDCs were mixed for better STPF results of the small-scale PV plants.  Although the amount of solar radiation is the most important parameter for PV prediction, it is not directly available from agencies [26]. Other weather parameters, such as the cloud cover, sunshine hours, humidity, and wind, also provide significant information for PV generation. For example, the differences in cloud movement over PV plants mainly create uncertain PV generation output. Because the impact of these parameters varies between places, the direct selection of the closest WDC for weather data collection is problematic. In addition, other accessible WDCs might be an option to exploit the useful information. Therefore, considering the weather data from the accessible WDCs could be effective for STPF operation in small-scale PV plants.
The larger the distance between the PV plant and WDC, the higher the probability of obtaining inferior weather data. Therefore, the collected weather dataset cannot effectively address the latent relationship between PV generation and weather parameters. Figure 2 depicts the average correlation measure between the generation of PV power and the cloud cover of each of the four WDCs. The nearest data center, WDC 1, and the most remote data center, WDC 4, are located at 6.01 km and 40.9 km away from the PV plant, respectively. In Figure 2, WDC 1 exhibits a better correlation measure only in the months of August and October. However, WDC 2 and WDC 3 demonstrate better results in April and December, respectively. In addition, WDC 4 presents comparable results with respect to WDC 1, for April, August, and December.
From Figure 2, none of the WDCs explicitly reveal a better correlation measurement, although cloud cover is the paramount weather parameter in PV forecasting. In addition to the cloud cover uncertainty, the distance has a significant impact on the relationship between the weather parameters and PV generation. Furthermore, the correlation values of the same WDC varied monthly. It is obvious that there is a problem in accepting weather data only from the closest WDC. Therefore, exploiting meaningful information from all the accessible weather data is one of the major approaches considered to enhance the reliability of the collected weather data. To obtain sufficient information, the inherent problem related to distance and weather data computation needs to be acknowledged.

Weather Data Mixing Models
Let the historical PV generation of any small-scale PV plant P, be expressed as follows: where The collection of raw weather data from the closest WDC is an existing method of weather data collection for small-scale PV plants. The reliability of the collected weather data depends on the distance between the PV plant and the closest WDC. The closest WDC model collects the weather data, W CL d , for day d, based on the shortest distance. The minimum distance, λ m , is directly calculated from λ 1 , λ 2 , . . . , λ m , . . . , λ M , that are the distances between the PV plant and M different WDCs. Based on the minimum index m, the weather data of the shortest distant WDC m, were directly used for W CL d .

Averaged Weather Data
Another existing approach through which the weather data can be easily calculated for a small-scale PV plant is the averaging of all the accessible weather data. The average weather data, W AVG d , for day d was calculated as follows: Both W CL d and W AVG d collect the adequate weather data when all the available WDCs are deployed within a few kilometers. The region needs to be either plain or it may even feature small hills. However, the geographical structure of the earth, such as hills and mountains, may have a higher altitude that usually alters the movement of clouds, windy phenomena, and humidity observation. Consequently, the collected weather data consist of a large amount of distraction and do not account for the significant improvement in the forecasting performance.

Inverse Distance Weighting (IDW) Model
The proposed IDW model is a deterministic approach in spatial interpolation that is comparatively fast and straightforward [27]. Major applications of the IDW equation are particularly found in the distance-based interpolation applications, such as wireless sensor networks [28], geographic information systems (GIS) [29], and computer science [30]. It facilitates the collection of valid weather data for a small-scale PV plant from a distance. This method assigns a weight for each WDC based on the inverse of the distance. A higher weight implies a closer distance between the PV plant and WDC and vice versa. The IDW model that accesses the weather data, W IDW d , for day d is defined as: Because an increase in distance outturns a decrease in inverse distance, the closest WDC has a higher impact on W IDW d than the remote WDCs. Although W IDW d captures meaningful weather information based on the distance, it does not account for the inherent relationship between PV generation and weather variables. In addition, when all the WDCs are located at a far distance, the collected W IDW d demonstrates a weak performance. Based on this weather data, the varying impact between PV generation and weather parameters is difficult to acknowledge.

Inverse Correlation Weighting (ICW) Model
The proposed ICW model assumes that the meteorological weather parameters have latent information for weather data calibration. The considerably more important latent information may vary among the weather data of each WDC. This varying impact can be captured through the correlation measures between the PV generation data and weather data.
Because this study considers multiple WDCs for a single PV plant, each WDC has a different correlation measure between their historical weather data and PV generation data. For any WDC m, the proposed ICW model evaluates the correlation measure ρ m d, n , of weather variable n, for day d that is given by: where p d is the daily average of PV generation for day d; W m d, n is the daily average value of weather variable n for day d. Equation (4) shows the diversity in correlation values that are calculated between the PV generation data and weather data, based on the weather variable n. The correlation values ranged from −1 to 1. The entire correlation information from all the weather variables is an ensemble, as given by: where ρ m d is the ensemble correlation measure of WDC m for day d. Unlike the distance, ρ m d varies only slightly among M-WDCs. Because the impact of the various factors related to the weather changes day by day, the ensemble correlation measure also changes. To overcome such deep and disparate complexity in correlation information, the inverse value of ρ m d is used for mixing the weather data for day d that is given by:

Similar Day Detection (SDD)
A robust SDD method was previously proposed by the authors' research group [14]. The SDD was developed based on a clustering algorithm [31]. Initially, employing this SDD method, the historical PV generation profiles were classified into K-different PV groups. Thereafter, each weather dataset from each model is separately passed to the SDD to detect similar day group k, for the next day.
Based on the identified similar day group k, the weather data belonging to similar weather group k, are detected from the passed weather dataset. The weather dataset is initially normalized because each weather variable has a different range. Within similar weather group k, the numerical weather predicted value w n k,d of each weather variable n for day d is determined. Each w n k,d of weather variable n within a similar weather group k has an average numerical weather predicted value w n k . Each weather variable n shows a deviation measure among the average predicted numerical weather values. This deviation r n is the difference between the maximum and minimum average numerical weather predicted values w n k , as follows: where n represents the weather variable and k represents the similar weather group k.
There is a vast literature on the impact of weather variables on PV generation [7,17,32]. Although there is a complex relationship between them, the input weather variables were directly categorized into two parts: primary weather variables (PWVs) and secondary weather variables (SWVs) [14]. PWVs are highly correlated weather variables for PV generation. However, SWVs are useful for acquiring only homogenous PV generation profiles within an identified group, k.
Weather variables with a higher deviation, r n , are selected as PWVs by defining a threshold α [14]. Within a similar weather group k, the deviation s n k is the difference between the maximum and minimum numerical weather predicted values w n k,d , as follows: where n represents the weather variable, and w n k,d is the numerically predicted value in the similar weather group k for day d. The weather variables that have a higher SWV deviation s n k than the SWV threshold β, are selected as SWVs [14]. In order to generate new homogenous PV series P in for the next day PV forecasting, the PWV variables that have higher deviation r n are used to identify the similar weather group k as follows: argmin where w n D+1 is the numerical predicted weather value of the weather variable n for the next day D + 1. In order to select more similar days from the similar weather group k, SWV similarity index φ k,d for day d within a weather group k is calculated as follows: The lower value of φ k,d explores the more similar days that are detected by defining the constant threshold γ [14]. With these days, corresponding PV generation profiles are collected in P in . Consequently, P in considerably solves the homogenous data requirement issues in PV forecasting.

Proposed Weather Data Mixing Model-Based PV Forecasting Framework
The raw weather data from the closest WDC, average weather data, and mixed weather data obtained based on the two proposed models might be dissimilar, although the same SDD is applied. Consequently, four different homogenous PV series, P CL in , P AVG in , P IDW in , and P ICW in , were obtained from the SDD. These series contain different numbers of homogenous PV generation profiles, although the selected PWVs and SWVs are the same. Each SDD output series is passed to the LSTM-based forecasting framework for the next-day PV prediction and comparison of results. Figure 3 illustrates the overall flow chart of the proposed weather data mixing model-based day-head PV forecasting framework for small-scale PV plants.
LSTM networks are found in many applications of PV forecasting [12,14], residential load forecasting [33], natural language processing [34], and speech recognition [35]. This network has improved PV forecasting accuracy when the input PV series contains numerous repetitive daily PV generation profiles. The selection of an inadequate PV generation profile in the training series may increase the forecasting errors, and the forecasting error is unavoidable.
The common goal of all SDD output series is to predict the day-ahead PV generation profile. Because these series behave similar to a time series, with only a difference in the length of the training data, a similar LSTM-based training model is developed for all the input PV series. The temporal dependencies (long-or short-term) between previous and current PV generation are effectively established using several components of the LSTM structure, such as internal memory cell, forget gate, input gate, and output gate. To learn complex patterns in PV generation data, these LSTM components are composed of corresponding activation functions. The LSTM training model is tracked, checked, and updated based on the sum of the mean square error (MSE) using the back-propagated gradient-based algorithm. Letp D and p D represent the predicted and actual PV generation profiles for day D at time step t. The developed objective function that minimizes the MSE can be written as: argmin where · F is the Frobenius norm, and θ represents the modeling parameter of the LSTM network. This parameter comprises various weights and biases that are repeatedly updated during the training process.

Simulation Results and Discussion
To evaluate the forecasting performance of the all-weather data collection method, two performance evaluation metrics were used: the mean absolute percentage error (MAPE) and root mean square error (RMSE). The day-ahead forecasting error, in terms of MAPE and RMSE, was measured using the following equation: where p D+1, t , andp D+1,t are the actual and predicted PV generation at time step t, respectively, and P capacity is the total installed PV capacity.

Data Description
To evaluate the proposed methodologies, the historical PV series were collected from four PV plants among twenty PV plants located in South Korea. The duration of the evaluation was one year, commencing from January 2018 and ending in December 2018. Weather parameters, such as average temperature, wind speed, wind direction, humidity, sunshine hours, cloud cover, atmospheric pressure, and rain amount, were used as weather variables. The average temperature, wind speed, humidity, atmospheric pressure, and rain amount are measured using a general scale of measurement. In addition, sunshine hour, cloud cover, and wind direction are measured in terms of percentile. Each observation of the weather variables is a ground observation that is performed at a fixed and same observation time. These parameters were successfully implemented in a previous study [14,25]. The Meteorological Agency's Open Weather Portal Office of South Korea performs weather observations using weather variables throughout the country and publishes via websites. Because South Korea is a country that mainly comprises mountains, several small valleys, and many narrow coastal plain regions, these tested PV plants are selected assuming that the maximum area will be covered. Table 1 shows the information about all the tested PV plants with the corresponding WDCs located at a certain distance. PV plants 1, 2, and 4 are located in remote hilly areas, whereas PV plant 3 is located in the city area. The selected plants do not have their individual meteorological data center for daily weather evaluation. The distance between PV plants and nearby WDC is limited to 50 km, which is the line-of-sight distance in mobile communication. In Table 1, only PV plant 2 had four WDCs that were accessible within a radius of 50 km. As the proposed ICW model deals with correlative information, the collected mixed weather data from this model and weather data from the four WDCs were used for correlation comparison. Figure 4 depicts the correlation comparison results between the ICW model and the four WDCs for April, August, October, and December. In Figure 4, the proposed ICW model shows a noteworthy improvement in the correlation measure that confirms the importance of the weather-mixing model.

Hyperparameter Tuning
The proposed weather-mixing method, the SDD method, and LSTM-based forecasting models were developed and tested using the Python programming language. The SDD and LSTM-based forecasting models use the Keras and TensorFlow library [34,36], along with the basic Python library. The proposed PV forecasting approach is simulated in a Windows operating environment using an i7-600 CPU at 3.40 GHz and 16 GB of installed RAM. Table 2 shows the optimum hyperparameter tuning for the LSTM-based forecasting model, inspired by [33,37].
Hidden layers, loss function, and optimizer are the common hyperparameters of the deep learning methodology, where an increase in more than three hidden layers still does not imply a higher forecasting performance in time series analysis [33]. For the proposed LSTM network, two hidden layers with 24 and 12 nodes were used in the first and second layers, respectively. These layers are composed of a nonlinear activation function (sigmoid and tanh). The model was trained by using resilient back-propagation (RMSprop) optimizer, which has better convergence over long short-term dependencies. The training process was maintained within 300 iterations because the input PV series were small and homogenous.
Mean squared error (MSE) was set as an evaluation metric, and batch size was set at 32 to obtain a proper gradient for the optimum convergence. For each day-ahead forecasting operation, each input dataset was divided into a testing set (1 day), validating set (10% days), and training set (remaining days).  Figure 5 depicts the actual and day-ahead PV forecasting outputs from the proposed models and existing methods of PV plants 1, 2, 3, and 4 during a randomly selected week. Table 3 shows the forecasting performance results in terms of the average MAPE and RMSE for the selected week.  For the selected week, the ICW model showed better forecasting output with PV plants 3 and 4. The averaged weather data and IDW model demonstrated precise performance results with PV plants 1 and 2, respectively. The PV plants 2 and 3 have the closest WDC at a distance of 6.01 km and 5.7 km, respectively. In these PV plants, the difference in the average MAPE results varied slightly within 1%. However, the nearest WDC for PV plants 1 and 4 was located at least 25 km away. These PV plants demonstrate above 1.5% variation in the MAPE results. The difference in MAPE results may be due to the distance between the PV plant and the nearby WDCs. The proposed ICW model shows a significant MAPE result, greater than 1.41%, as compared with the IDW model (the second-best performer) in PV plant 4. Both the proposed IDW and ICW models showed better forecasting accuracy compared to the existing methods. This explains both the distance-and correlation-based proposed models, which are particularly significant and useful for the mixing of weather data from all the accessible WDCs.

Seasonal Evaluation
The weather dataset shows entangled activities when the seasons are changed. In [10], the weather dataset was decomposed into spring, summer, autumn, and winter to acquire a better forecasting output. However, the splitting process for weather data may cancel the inherent consistency of the weather dataset. To maintain regularity in the weather data, this study used prior 120-day windowing for each day-ahead PV prediction. To perform seasonal evaluation, the proposed method selects four arbitrary weeks from each season. Figure 6 shows box plot results of the overall MAPE computed based on the proposed models and the existing methods for PV plants 1, 2, 3, and 4. Table 4 shows seasonal average MAPE and RMSE results from the proposed models and existing methods of the PV plants 1, 2, 3, and 4. For PV plants 1 and 4, the proposed ICW model shows improvements above 0.8% and 0.3%, respectively, in terms of forecasting MAPE compared with the average weather data (the second-best performer). In the case of PV plant 3, the IDW model shows improved forecasting MAPE and RMSE results, above 0.2% and 2.58 kWh, respectively. Because PV plant 3 has one WDC located at a distance of 5.27 km, the IDW model provides improved RMSE results compared with the correlation-based ICW model.
PV plant 2 has four WDCs, and the closest WDC is situated at a distance of 6.01 km. In this PV plant, the proposed ICW model outperforms all the proposed distance-based models by more than 0.13 kW for the average RMSE results. Similarly, in terms of the MAPE results, both the ICW and IDW models are comparable in each season of the year. This highlights that the proposed ICW and IDW models have an improved weather data collection ability from the nearby WDCs.

Conclusions
In this paper, two weather data mixing models were proposed to collect suitable weather data for day-ahead PV forecasting in small-scale PV plants. These mixing models collect mixed weather data from all the accessible WDCs within a defined distance. Among the four PV plants tested, two PV plants that had the closest WDC at least 25 km away exhibited better performances compared with the proposed ICW model. In these plants, the impact of the distance from the source of weather data was significantly reduced. In addition, the other proposed IDW model showed a higher PV forecasting accuracy in the other two PV plants, which have the nearest WDC located within 6 km. However, the raw weather data obtained by using the closest WDC (a conventional weather data collection method for the STPF task) did not lead to better PV forecasting accuracy in all the tested small-scale PV plants. This highlights that the proposed models enhanced the forecasting accuracy for small-scale PV plants, even when these plants were installed and deployed in remote areas from the WDCs. In the future, the day-by-day selection procedure of the mixing model will be developed that increases the forecasting performance for the small-scale PV plants.