Machine Learning Modeling of Horizontal Photovoltaics Using Weather and Location Data

: Solar energy is a key renewable energy source; however, its intermittent nature and potential for use in distributed systems make power prediction an important aspect of grid integration. This research analyzed a variety of machine learning techniques to predict power output for horizontal solar panels using 14 months of data collected from 12 northern-hemisphere locations. We performed our data collection and analysis in the absence of irradiation data—an approach not commonly found in prior literature. Using latitude, month, hour, ambient temperature, pressure, humidity, wind speed, and cloud ceiling as independent variables, a distributed random forest regression algorithm modeled the combined dataset with an R 2 value of 0.94. As a comparative measure, other machine learning algorithms resulted in R 2 values of 0.50–0.94. Additionally, the data from each location was modeled separately with R 2 values ranging from 0.91 to 0.97, indicating a range of consistency across all sites. Using an input variable permutation approach with the random forest algorithm, we found that the three most important variables for power prediction were ambient temperature, humidity, and cloud ceiling. The analysis showed that machine learning potentially allowed for accurate power prediction while avoiding the challenges associated with modeled irradiation data.


Introduction
Power generation from solar photovoltaics (PV) is expected to grow 30% in the next five years, and much of this growth is anticipated to be in the form of distributed solar PV systems [1]. Distributed PV can be advantageous to residential customers and commercial/government facilities-both in urban settings as well as more disperse settings (e.g., remote military installations)-where there may be limitations on building large, centralized PV arrays. The challenge of intermittency for solar energy is well-established and highlights the critical function of forecasting solar PV power output-especially in a distributed environment. Solar PV power forecasting has been studied extensively. Lorenz et al. (2014) provided an overview [2], and Raza et al. (2016) discussed recent advances [3]. Often, solar power forecasting studies are based on predicting irradiance or using historical power output. Yang et al. (2015) used exponential smoothing to improve predictions of horizontal irradiance [4]; Gueymard (2008) studied irradiance forecasting for surfaces of any angle [5]. Lorenz et al. (2010) used regional weather data to forecast irradiance, which was then converted to power [6]. Various studies have considered predicting irradiance or power using weather and prior power output data [7,8]. Additionally, previous studies forecasting solar irradiance or power output are often based on data from a limited number of locations [8][9][10][11].
The novel aspect of this research is the quantification of the ability of machine learning to predict photovoltaic power output in the absence of irradiation data, using collected data from a range of climate zones. This is motivated by challenges with available irradiance data; it is conceptually a reliable predictor of solar power output, and irradiance is found to be the most important factor in predicting solar panel power output in two photovoltaic studies that utilize modeling [12,13]. However, irradiation data can be time-consuming to measure at a specific site, and prediction of irradiation can generate forecast errors, may be unsuitable for accurate PV performance analysis, and may contain 8-25% uncertainty if modeled [7,9,14,15]. Additionally, this work studied forecasting power output for horizontal PV arrays for the following reasons: 1.
Many entities do not have space available to install large solar arrays; thus, horizontal, distributed arrays, such as building rooftops, can broaden the opportunities to implement solar energy.

2.
Many models have been developed for latitude-tilted applications [16]. While latitude-tilted solar panels possess the ability to capture more direct solar irradiation, horizontal solar panels have been found to perform better under diffuse irradiation conditions [17][18][19][20].
Accordingly, we performed our data collection and analysis of horizontal photovoltaics in the absence of irradiation data. This tested the hypothesis that accurate power prediction can result from the combination of advances in machine learning and avoided irradiation uncertainty. The objective of the work was to quantify the ability of this approach.
The approach used in this work was based on the following selection of input variables and the type of photovoltaic panel. There are several factors identified in prior research that affect both the irradiation that reaches the panel and the panel's ability to convert the irradiation to usable energy:

•
Cloud Ceiling: the presence of clouds above a panel will scatter solar irradiance and decrease the amount of irradiation a panel receives; the cloud ceiling is measured at the altitude where at least 5/8ths of the sky above the weather station is covered by clouds [17][18][19][20][21][22][23][24][25]; • Latitude: the latitude of each location will dictate the sun deflection angle; this will affect the amount of sunlight the panel receives [12,[21][22][23]25,26]; • Month: when the sun rises and sets and how high it will appear in the sky at any location on the earth is determined (in part) by the time of year at that location [13,21]; • Hour: the time of day determines how high the sun is in the sky-or whether or not it is present at all. Hour controls for the sun's position in relation to the time of day [21]; • Humidity: water affects incoming sunlight through refraction, diffraction, and reflection. Indirectly, humidity also affects dust build-up on panels due to the formation of dew increasing coagulation of dust [27]; conversely, dew formation on the surface of a panel may increase performance when compared to a humid air condition [28]; • Temperature: the efficiency of a solar panel will generally decrease with an increase in panel temperature [29,30]. Including temperature as an explanatory variable for power output has led to increased predictability [12,13,[31][32][33]; • Wind speed: the temperature of the panel may be affected by the speed of the wind surrounding the panel [34,35]. Increased wind speed can also clean the dust off of the panel surface or stir up dust, thereby affecting the irradiance that reaches the panel [36]; • Visibility: this variable is a measurement of the distance at which a light can be seen and identified [37]. Visibility will primarily affect how much irradiation reaches the panel and can have a negative effect on power output if visibility is low during daylight hours; • Pressure: Pressure may have an effect on power output predictability by indicating a weather occurrence-such as a storm [38]; this variable has not been extensively explored in solar panel power output literature; • Altitude: there is less atmosphere for the sun to travel through at locations with higher altitudes; this results in a higher level of irradiation at locations farther above sea level.
Monocrystalline and polycrystalline silicon PV panels comprise nearly 90% of the world's photovoltaics and achieve efficiencies of 15-25% and 13-16%, respectively [39]. Polycrystalline panels were selected for this analysis as they are more widely installed than monocrystalline photovoltaics and have a lower cost, making them well-suited for distributed PV settings.
Prior researchers have predicted photovoltaic power output or efficiency utilizing multiple input factors, such as irradiation, temperature, humidity, solar elevation angle, wind speed, wind direction, month, and others [37][38][39][40]. Table 1 summarizes the key characteristics from four photovoltaic studies; the present work was also included for comparison. The table highlights that numerous variables have been studied for use in photovoltaic modeling over various timeframes, depending on the research objectives. In Table 1, short is defined as having an effect within a day, medium is on the order of months, and long is an effect that takes a year or more to impact the power output. Busquet et al. (2018) primarily studied the medium-and long-term effects of factors, such as aging and soiling; panel age is not commonly used by other studies [35]. Aging describes the amount of time the panel has been installed and exposed to the elements, and soiling describes the dust build-up of the panel's surface. Kayri et al. (2017) and Lahouar et al. (2017) forecasted solar panel power output and used short-term factors, such as solar elevation angle and wind direction. However, they did not include longer-term factors, such as aging [12,13]. Mekhilef et al. (2012) conducted a medium-timeframe review primarily interested in the effects of dust, humidity, and air velocity, including the contribution of water droplets trapped inside the cell and dew-induced dust accumulation [27]. Solar irradiance is one common factor that the four studies used. The present work differentiated itself from prior work by predicting horizontal solar panel power output only using readily available data-such as position, time, and weather, while not including irradiation. If the power output of a solar panel can be reasonably predicted without including irradiation as an input, then it becomes easier to assess the cost-effectiveness of a PV array at any global location.

Materials and Methods
This section presents the procedures and processes used in this study. A description of the test equipment used to gather the data, how the data was processed for predictive modeling, model development, and validation methods are provided.

Materials and Equipment
The test systems used in the study were designed and manufactured as part of a previous research effort and were distributed to global United States Air Force (USAF) installations [40]. The test systems were comprised of the following equipment: The Raspberry Pi computer system was used to record the following information at 15-min time intervals: panel power output, temperature, humidity, date, and time. The SD card in the computer was retrieved by the site monitors and downloaded every month, and the dataset was sent to the researchers. Site monitors at each location were given instruction to clean off the panel whenever dust or snow cover was observed. Although this was performed daily for some locations, others were cleaned less frequently. The unknown frequency of panel cleaning at some locations was a known limitation of this research.

Data Description
Data collected from 12 locations were utilized within this study-the data is available for further analysis [41]. The collection locations were selected from a larger dataset of all Department of Defense (DoD) installations located within 25 regions [40]. Using this dataset, along with a recognized climate classification matrix, a Pareto analysis was performed to determine the locations of test sites within climate regions [40]. While reviewing the collected data, the project team discovered that only a subset of locations collected reliable data. After post-processing the data, the team chose to limit collection data to the northern hemisphere. This decision was motivated by seasonal differences between hemispheres and selecting collection sites in close proximity to National Oceanic and Atmospheric Administration (NOAA) weather stations.
The test systems at each location provided the ambient temperature, relative humidity, timestamp, and power output for each panel. Altitude, latitude, and four weather variables from the NOAA were also added to the dataset. The weather stations that recorded the NOAA wind speed, cloud ceiling, visibility, and atmospheric pressure data were located at airports no more than five miles from each test system [42]. The cloud ceiling data measured the lowest cloud layer with 5/8ths or greater opacity, and a value of 22 km indicated a lack of cloud cover.
A graphical depiction of the 12 locations is provided in Figure 1; there were two sites in Colorado that appeared as a single red dot due to their proximity. Additionally, Table 2 provides the latitude, longitude, and Köppen-Geiger climate region of each location. Note, all latitudes were north, and all longitudes were west. Note, seven different climate regions were represented in the dataset, indicating a diverse range of locations. Energies 2020, 13, x FOR PEER REVIEW 6 of 14  Descriptive statistics for each numeric variable are shown in Table 3; hour and month were not listed as they were described as categorical variables in the model.

Data Pre-Processing
The initial dataset was filtered to only include the time window of 10:00-15:45 to avoid modeling periods of darkness and reduced sunlight. This restriction also helped mitigate possible obstructions from both natural and man-made objects when the sun was low in the sky. Next, the pairwise  Descriptive statistics for each numeric variable are shown in Table 3; hour and month were not listed as they were described as categorical variables in the model.

Data Pre-Processing
The initial dataset was filtered to only include the time window of 10:00-15:45 to avoid modeling periods of darkness and reduced sunlight. This restriction also helped mitigate possible obstructions from both natural and man-made objects when the sun was low in the sky. Next, the pairwise correlation coefficients for all numeric variables across all sites were calculated-the results are presented in Figure 2. Only one pair of variables showed a high correlation coefficient: altitude and pressure. Altitude was subsequently removed since its value did not change with location, whereas pressure did have some degree variation for a location-i.e., power output. correlation coefficients for all numeric variables across all sites were calculated-the results are presented in Figure 2. Only one pair of variables showed a high correlation coefficient: altitude and pressure. Altitude was subsequently removed since its value did not change with location, whereas pressure did have some degree variation for a location-i.e., power output.

Machine Learning Modeling
H2O.ai is an open-source machine learning tool used in this study to compare various modeling algorithms to determine the best fit for power output; H2O.ai includes a tool called AutoML that automates the machine learning model building process through a graphical user interface [43,44]. For this research, algorithm accuracy was assessed using the entire dataset and using a crossvalidation process, which divided the dataset into k bins, and then during each iteration of the model building process for a given algorithm, one bin was the validation set, and the other k-1 bins were the training set. Thus, k cross-validation models were built for each algorithm. For reproducibility, the number of folds was set to k = 5, the maximum runtime was limited to 8000 s, and other H2O.io input parameters were set by the software to the default values.
Six algorithms were compared in this research. The first five are the popular "base learner" algorithms [43]. The sixth algorithm (stacked ensemble build) is referred to as a "metalearner"; it creates an additional model, which is a combination of models from the other five algorithms. Descriptions of the six machine learning algorithms are provided below [45]:

•
Deep learning is designed using the "multi-layer feedforward artificial neural network that is trained with stochastic gradient descent using back-propagation." This method provides understanding into network behavior based on altering the weights and biases;

Machine Learning Modeling
H2O.ai is an open-source machine learning tool used in this study to compare various modeling algorithms to determine the best fit for power output; H2O.ai includes a tool called AutoML that automates the machine learning model building process through a graphical user interface [43,44]. For this research, algorithm accuracy was assessed using the entire dataset and using a cross-validation process, which divided the dataset into k bins, and then during each iteration of the model building process for a given algorithm, one bin was the validation set, and the other k-1 bins were the training set. Thus, k cross-validation models were built for each algorithm. For reproducibility, the number of folds was set to k = 5, the maximum runtime was limited to 8000 s, and other H2O.io input parameters were set by the software to the default values.
Six algorithms were compared in this research. The first five are the popular "base learner" algorithms [43]. The sixth algorithm (stacked ensemble build) is referred to as a "metalearner"; it creates an additional model, which is a combination of models from the other five algorithms. Descriptions of the six machine learning algorithms are provided below [45]:

•
Deep learning is designed using the "multi-layer feedforward artificial neural network that is trained with stochastic gradient descent using back-propagation." This method provides understanding into network behavior based on altering the weights and biases; • Distributed random forest (DRF) randomly selects a subset of the features and generates a single forest of regression or classification trees based on those features; this process is repeated-based on the number of trees specified-with a random subset on each iteration. The predictions are based on the average prediction of all of the trees in the forest; • Distributed random forest extremely randomized trees (XRT) select thresholds differently when compared to the distributed random forest model. Thresholds from a random subset of features are chosen at random and ranked by the best threshold.

Impact of Input Variables
In the absence of irradiance data, understanding the importance of the input variables used to predict horizontal solar panel power output was important. Variable importance was determined by measuring how much each variable decreased the model mean squared error (MSE), defined as: In Equation (1), n is the number of validation data points, y i is the actual response, andŷ i is the predicted response. The MSE was calculated again after permuting each predictor variable and then subtracting the MSE of the validation dataset. The average change in MSE for each predictor variable permutation was then determined. This value was then scaled by dividing the MSE reduction by the variable's standard error. Figure 3 below provides a flowchart of the analysis used for this study. While steps 2 and 3 were specified for the DRF algorithm, the general flow would apply to each algorithm assessed.

Methodology Summary
Energies 2020, 13, x FOR PEER REVIEW 8 of 14 • Distributed random forest (DRF) randomly selects a subset of the features and generates a single forest of regression or classification trees based on those features; this process is repeated-based on the number of trees specified-with a random subset on each iteration. The predictions are based on the average prediction of all of the trees in the forest; • Distributed random forest extremely randomized trees (XRT) select thresholds differently when compared to the distributed random forest model. Thresholds from a random subset of features are chosen at random and ranked by the best threshold.

Impact of Input Variables
In the absence of irradiance data, understanding the importance of the input variables used to predict horizontal solar panel power output was important. Variable importance was determined by measuring how much each variable decreased the model mean squared error (MSE), defined as:

1
(1) In Equation (1), n is the number of validation data points, yi is the actual response, and is the predicted response. The MSE was calculated again after permuting each predictor variable and then subtracting the MSE of the validation dataset. The average change in MSE for each predictor variable permutation was then determined. This value was then scaled by dividing the MSE reduction by the variable's standard error. Figure 3 below provides a flowchart of the analysis used for this study. While steps 2 and 3 were specified for the DRF algorithm, the general flow would apply to each algorithm assessed.

Results
The R 2 , mean absolute error (MAE), and root MSE (RMSE) training data results for each algorithm are presented in Table 4; the DRF algorithm was the most accurate in modeling power output for the full dataset.

Results
The R 2 , mean absolute error (MAE), and root MSE (RMSE) training data results for each algorithm are presented in Table 4; the DRF algorithm was the most accurate in modeling power output for the full dataset. The primary methods of assessing the accuracy of the results were the R 2 , MAE, and RMSE values presented in Table 4. During the training process, additional insight into model performance could be gained from the results in the cross-validation process, which are presented as the right column of Table 4 for the six algorithms. Cross-validation allowed for an efficient way to test the predictive capability of an algorithm on data not included in training the model. Based on the cross-validation results (using five folds), the stacked ensemble build and gradient boosting machine methods performed slightly better than the DRF method-a 2.1% and 1.2% increase in R 2 , respectively. Based on the results in Table 4 and the commonality of the distributed random forest algorithm (DRF) with our comparison studies, we conducted further analysis on the DRF model.
Random forest regression is an ensemble method that aggregates a series of individual regression trees in order to reduce model variance. The random forest model consisted of a number of decision trees and a separate number of decision variables for each tree. Using the method described in Section 2.5, the variable importance rankings-across all locations-are presented in Table 5. In the modeling process, multiple values for the number of decision trees were explored; the default value was 50 trees. For comparison, the rankings for 500 trees were also presented-and the rank order did not change. The three most important variables were ambient temperature, humidity, and cloud ceiling. The data from each location was then modeled separately using the DRF algorithm. Table 6 presents the results and shows there is location-dependent variation between ambient temperature, cloud ceiling, and humidity as the main drivers of model performance. This was expected as the locations of the test sites vary across eight climate regions where solar energy potential is affected by accompanying variations of temperature, humidity, and cloud cover [48]. Ambient temperatures and humidity were the top two primary influencers of solar power output in nine of the 12 locations. The results in Table 6 were relatively consistent across locations, which indicated that the independent variables provided sufficient information to be applied across a range of geographical locations. An important variable for predicting power would seemingly be latitude; however, it was ranked seventh in both the 50-tree and 500-tree models. The relative unimportance of latitude might be due to the limited latitude range included in the model. Latitudes in the northern hemisphere range from 0-66 degrees; however, the latitude range for the dataset was only 21-48 degrees. As shown in Table 5, the DRF algorithm best predicted the Travis data, whose location is 38.16 degrees latitude and 121.56 degrees longitude within the hot-dry climate region. Camp Murray in Washington had the second-best model performance; this site is located at 47.11 degrees latitude and 122.57 degrees longitude in the mixed-humid climate region. Between these two sites, the higher percentage of humidity and ambient temperature influence in Camp Murray was likely due to larger seasonal variations in these variables. In contrast, the model performance for the Kahului, Hawaii site, was the poorest. The seasonal weather variation there was substantially different from the remainder of the sites. A final observation from Table 6 was the difference in model performance between USAFA and Peterson. While the sites are only 20 miles apart, they have significantly different geographical characteristics as USAFA is nestled on the Rocky Mountain foothills, and Peterson is located on the plains. In such a scenario, predicting output in the absence of irradiation data may be beneficial as irradiation may not vary significantly between locations.

Discussion
To the best knowledge of the authors, this work provided the first study to predict the power output of geographically distributed horizontal polycrystalline solar panels in the absence of irradiation or previous power output data. Although it can be challenging to make exact comparisons with previous research due to the range of potential differences, it is still insightful to see how these results compare to other studies. There has been modeling done for a range of algorithms and datasets to forecast solar PV energy output using solar irradiance. Ahmad et al. (2018) predicted hourly energy output and reported training set R 2 values of 0.9105, 0.9272, and 0.9367 for support vector machine, extremely randomized trees, and random forest models, respectively [49]. Ramsami and Oree (2015) used single-stage and stepwise regression and neural network models with correlation coefficients ranging from 0.914 to 0.937 [50]. Additionally, Pedro and Coimbra (2012) used previous power output data in time series, neural network, and nearest neighbor models to forecast one-hour ahead energy output with R 2 values ranging from 0.91 to 0.96 for the full validation set [51].
We also presented our results in the context of the three quantitative studies summarized in Table 1. Table 7 displays the present results next to the most applicable subset of results from the three other studies. It is important to emphasize that the purpose of these comparisons was to understand the context of forecasting solar power output in the absence of irradiation data. The results presented were chosen to make the comparisons as close as possible, i.e., most similar algorithms and type of solar panel, but there were still differences in the tuning parameters, the definition of power for the dependent variable, the available independent variables, and the time period of the data. In general, the results of this study indicated that solar power prediction might be suitable in the absence of irradiation data as the quantitative performance measures were not out of a family with the other studies. One notable difference was this study included nine independent variables, whereas the three comparison studies in Table 7 used six. These additional parameters might have sufficiently compensated for the lack of irradiation data, which was consistently shown to be the most important variable in the other studies. Lahouar et al. (2017) conducted an additional analysis excluding irradiation data, with a resulting MAE = 44,271 W and RMSE = 59,391 W [13]; these measures were significantly higher than the DRF results in this study. These differences might be due to the larger power systems, short timeframe (a single week in January), the exclusion of other independent variables, or a smaller data set.

Conclusions
In summary, using only weather, time, and geographic variables, 14 months of data from 12 northern-hemisphere locations were modeled using a variety of machine learning techniques. These data contributed to an R 2 = 0.94 model accuracy using the distributed random forest algorithm on the full dataset within the H2O.ai platform. This work indicated that advances in machine learning could potentially facilitate accurate prediction of horizontal photovoltaic panels without irradiation data; this type of prediction was beneficial as irradiation data could be time-consuming to measure or contain significant uncertainty if modeled. Additionally, we identified the three most important weather variables for power prediction in the absence of irradiation data as ambient temperature, humidity, and cloud ceiling.
This type of analysis could be practically useful in supporting feasibility and cost-effectiveness decisions for the use of solar power for geographically distributed entities-especially ones of an agile or expeditionary nature. For example, consider the power requirements of a small, quick response team-such as those responding to a humanitarian crisis or natural disaster-that operate over a diverse range of locations. These teams often work in austere environments without a local, reliable power source. The ability to determine the feasibility and scale of a distributed PV power in support of these teams without requiring the time to gather or model irradiation data could be valuable. This benefit could also extend to the growth of distributed residential PV in rural areas.
In addition to the applicability of this forecasting, the scalability of this particular study had both advantages and disadvantages. On the advantage side: (1) the study was conducted over a relatively diverse set of locations (as noted in Section 3); (2) the data was collected in a controlled manner-e.g., there was specific installation and operation guidance provided to each site; (3) by each measure of accuracy, there were multiple machine algorithms that gave a similar performance, indicating some degree of robustness to the choice of algorithm. Conversely, the collection of the weather data was not pre-planned as part of gathering the solar panel data, nor were the comparison studies identified prior to executing the machine learning algorithms.
Therefore, future research could extend the benefit of the efficacy of this type of forecasting. An experiment could be conducted whereby a distributed solar PV system is sized based on a nominal requirement and the forecasted power output using this model; then, measure how well the system met the power requirements. Additionally, the collection of weather data could be automated or linked directly with the location of the PV system (as opposed to the local weather station). Finally, further comparisons of these results with other models could be studied.