A Log-Logistic Predictor for Power Generation in Photovoltaic Systems

: Photovoltaic (PV) systems are dependent on solar irradiation and environmental temperature to achieve their best performance. One of the challenges in the photovoltaic industry is performing maintenance as soon as a system is not working at its full generation capacity. The lack of a proper maintenance schedule affects power generation performance and can also decrease the lifetime of photovoltaic modules. Regarding the impact of environmental variables on the performance of PV systems, research has shown that soiling is the third most common reason for power loss in photovoltaic power plants, after solar irradiance and environmental temperature. This paper proposes a new statistical predictor for forecasting PV power generation by measuring environmental variables and the estimated mass particles (soiling) on the PV system. Our proposal was based on the ﬁt of a nonlinear mixed-effects model, according to a log-logistic function. Two advantages of this approach are that it assumes a nonlinear relationship between the generated power and the environmental conditions (solar irradiance and the presence of suspended particulates) and that random errors may be correlated since the power generation measurements are recorded longitudinally. We evaluated the model using a dataset comprising environmental variables and power samples that were collected from October 2019 to April 2020 in a PV power plant in mid-west Brazil. The ﬁtted model presented a maximum mean squared error (MSE) of 0.0032 and a linear coefﬁcient correlation between the predicted and observed values of 0.9997. The estimated average daily loss due to soiling was 1.4%.


Impact of Soiling on Power Generation in Photovoltaic Systems
Photovoltaic (PV) systems have a high penetration capacity with a low environmental impact [1]. Some countries are more suitable as sites for PV plants due to their high solar irradiation. Brazil, for example, has vast solar energy potential, mainly in the northeast region, which offers the highest average solar radiation and the lowest annual variability [2]. International agencies have predicted that the use of PV energy for worldwide power generation will increase from 2% in 2018 to 25% in 2050 [3].
One of the challenges in the photovoltaic industry is performing maintenance as soon as a system is not working at its full generation capacity [4]. The lack of a proper maintenance schedule affects power generation performance and can also decrease the lifetime of photovoltaic modules.
One source of power generation loss is soiling on PV modules. Soiling may attenuate solar irradiation on the PV cells, thus reducing their photovoltaic effects and energy generation capabilities. The uniform dispersion of soiling affects spectral transmittance, which is a characteristic of photovoltaic technology that is used to construct the modules [5]. Suspended particles in the air are one of the primary sources of soiling. The average annual losses in the performance of solar plants due to soiling range from 3% to 6%, while monthly evaluations indicate power losses from 14% to 20% [6,7]. Correctly understanding soiling deposition patterns is crucial to solving this problem.
Soiling particles in the air change according to the environment, leading to different impacts depending on the location of the photovoltaic system [8]. Regarding the endemic features of soiling, the monitoring of environmental parameters and suspended particles in the air should be local (within the PV power plant site) and continuous. Understanding the economic implications of soiling on photovoltaic systems is essential to increase energy production, optimize cleaning routines and minimize the associated costs [9]. The concern around the impacts of soiling on PV energy generation is primarily due to the increase in the number of large-scale PV power plants in isolated regions that are exposed to high soiling levels. The online monitoring of PV systems can benefit decision-making regarding the maintenance and cleaning of PV modules to reduce the impacts of soiling on power generation [10].

Related Work
Many studies have proposed power loss estimates from soiling on PV power plants [11]. The proposed techniques range from applying images (from cameras and satellites) and computer vision algorithms to adopting environmental sensors to identify and estimate the impacts of soiling on power generation.
Mehta et al. [12] determined the impacts of soiling by designing models that were based on convolutional neural networks (CNNs), the RGB images of solar panels and environmental variables. The authors presented the impacts of soiling on power loss and achieved an accuracy of up to 97.82%. Another soiling identification technique was based on finding "hot spots" on photovoltaic modules [13,14]. This technique was carried out using radiometric sensors that showed a high degree of reliability compared to thermographic cameras and achieved an accuracy of 99.02% and a precision of 91.67%.
Pavan et al. [15] compared two techniques to determine the effects of soiling on largescale photovoltaic plants. The experiment was carried out at two solar plants in different sites in Italy. The authors developed prediction models using a Bayesian neural network (BNN) and a polynomial regression model (PRM). PRMs have also been used in previous research on the same systems [16]. The power loss due to dust particle accumulation on the PV systems ranged from 1% to 5%.
The cleanliness index (CI) [17] has also been applied to evaluate performance loss that is caused by soiling. Those authors used an artificial neural network (ANN) and multilinear regression (MLR) to estimate the CI. As reported by the authors, the ANN model achieved better results than the MLR model (ANN: R 2 = 0.54; MLR: R 2 = 0.17). The authors also observed that wind speed and relative air humidity were the environmental variables that had greater effects on the amount of dust that was transported and accumulated on the PV modules.
Hammad et al. [18] studied the impacts of soiling on PV power plants in the Middle East and North Africa (MENA) region. The authors highlighted the usage of MLR and ANN models to estimate the impacts of soiling on power loss and the optimal momentum to clean the modules in the PV systems. The models considered dust exposure time and environment temperature as the independent variables and the PV system conversion efficiency as the dependent variable. Once all of the losses due to inverters, array mismatch, operating environment temperature and dust accumulation were taken into account, the authors could eliminate the dust effect. So, the difference between the conversion efficiency before and after that elimination corresponded to the soiling impact [19].

Statistical Methodology for Soiling Estimation
The power generation predictor that is proposed in this work was based on the fit of a nonlinear mixed-effects (NLME) model, according to a log-logistic function. The primary motivation for considering a log-logistic model was that log growth models can adequately model the accumulated measurements of generated power over the course of a day. Random effects were introduced to model the correlations among the measurements of generated power since they were recorded longitudinally. In addition, we considered irradiance and the accumulated mass (soiling) on the photovoltaic modules as the explanatory (independent) variables.
In order to estimate the model parameters, we adopted the maximum likelihood method. The estimates were obtained numerically using the algorithm that was proposed by Lindstrom and Bates [20]. We then illustrated the performance of the proposed model using a dataset comprising power generation and environmental (environment temperature, irradiance and the mass of the particles on the modules) samples that were collected from a photovoltaic plant located at the Federal University of Mato Grosso do Sul (lat.: −20.510867; long.: −54.619882). The model validation and evaluation were performed by comparing the mean squared errors (MSEs) of the predicted (from the NLME model) and the observed (from the PV power plant) generated power values.

Materials and Methods
The experimental data were collected from a PV power plant on the campus of the The environmental and soiling monitoring system that was available in the PV plant was based on an ESP32 electronic platform (NodeMCU ESP32). The sensors that were used to collect the environmental data were a pyranometer (Hukseflux SR05-DA2) for measuring radiation at the site of the photovoltaic system, a temperature sensor (DS18B20) and a sensor for detecting suspended atmospheric particulates (Sensirion SPS30), all of which collected data every 1 min. The electronic platform collected, gathered and controlled the environmental and soiling measurements from the sensors and the generated power data from the inverter. All of the data were then sent to an AWS cloud server (there was a wireless internet connection available at the PV plant site).

Estimation of Accumulated Particle Mass
According to El-Shobokshy and Hussein [21], Hupa et al. [22] and Javed et al. [17], the estimation of the particle mass on PV modules relies on several parameters, such as chemical composition, the properties of the module surface, wind speed and air particle suspension. Coello and Boyle [23] proposed to estimate the particle mass deposition on PV modules using the following equation: where: • M is the accumulated mass in the time interval (g/m 2 ); • t is the time interval in seconds; • v is the particulate deposition speed (g/m); • C is the particulate concentration in the environment (g/m 3 ); • θ is the PV module angle; • 10−2.5 is the index of the particles from 10 µm to 2.5 µm; • 2.5 is the index of particles less than 2.5 µm.
At the PV plant location in this work, the particle sensor did not detect particles above 2.5 µm, so Equation (1) was adjusted for particulates below 2.5 µm. Other adjustments to the original model were also necessary, such as the sensor height, module tilt and constant values. Figure 1 depicts the typical behavior of the particles that was captured over the course of a day by the Sensirion SPS30 sensor at the PV power plant site. PM1.0 and PM2.5 represent the particulate material identified in the environment with diameters up to 1.0 and 2.5 µm, respectively.   Figure 3 presents the weekly particle concentration averages. From 23 December to 3 January, the particle sensor was not functional, so there were no data collected during this short period.

Dataset and Statistical Modeling
We let D be the dataset comprising the measurements of the PV-generated power and the environmental variables (temperature, irradiance and the accumulated mass of particles) from a period of d = 105 days.
In order to build this dataset, we assumed the following procedure. For each day, the environmental and generated power data were registered every r = 10 minutes throughout the period between the first and last samples of solar irradiance (the start and the end of each day). On the j-th day, there were n j measurements of PV-generated power, temperature, irradiance and particle mass for n j ∈ Z + and j = 1, . . . , d, where Z + represents a set of positive integers. Figure 4 shows the generated power, irradiance, temperature and accumulated mass measurements that were recorded on days 1 and 2. The generated power, temperature and irradiance are represented in the log scale to better visualize the plot. The values for the accumulated mass are in the original scale. Regarding day 1, the first measurement was recorded at 06:12 am and 74 measurements (x axis) of each variable were taken in total. On day 2, the first measurement was recorded at 06:14 am and 72 measurements of each variable were taken. The behavior of the PV-generated power and irradiance was pretty similar. The values of the estimated soiling (from the accumulated mass of particles) were near zero on the two first days because the experiment began just after a clean-up procedure was carried out on the PV modules.   We split dataset D, which had data samples from 105 days, into two sub-datasets: D 0 and D 1 . Dataset D 0 was the training set, which was composed of measurements from the first 21 days and was used to fit the prediction model. Dataset D 1 was the testing set, which had measurements from the last 84 days and was applied to verify the performance of the fitted prediction model. Table 1 shows a summary of the main descriptive statistics of the training set D 0 . The generated photovoltaic power ranged from 2.77 W to 8475.12 W, with an average of 3858.56 W. The irradiance values ranged from 0.10 W/m 2 to 1197.20 W/m 2 , with an average of 470.50 W/m 2 . The environment temperature ranged from 19.15 • C to 39.34 • C, with an average of 30.80 • C. The maximum accumulated mass of particles during the period was 0.3920 µg/m 3 . The statistical model assumed that no linear relationship existed among the explanatory variables. We applied Pearson's correlation for each pair of variables from D 0 ( Table 2). On the one hand, the irradiance and temperature had a strong positive linear relationship. On the other hand, irradiance had a weak correlation to the accumulated mass and the temperature and accumulated mass had a moderate correlation. Since there was a linear correlation between irradiance and temperature, temperature was disregarded for the model fit. We let W tj be the recorded power that was generated at the t-th instant in time on the j-th day for t = 1, . . . , n j and j = 1, . . . , d. Analogously, we let I tj be the irradiance that was recorded at the t-th instant in time on the j-th day. The accumulated mass of particulates M tj (accumulated mass) on the PV system was calculated according to Equation (1).
As illustrated in Figure 4, the W tj values had a high variability and demonstrated an unstable nonlinear behavior, making a linear modeling procedure somewhat complicated. We developed a modeling procedure that considered the power that was generated once those values presented a more stable behavior. Figure 5 shows the solar power that was generated over the course of a day, in both the original and log scales.  For simplicity, we left the index j out and used W t to denote the solar power that was generated at the t-th instant in time on the j-th day. Similarly, we used I t and M t to represent the irradiance and the accumulated mass of particulates that were recorded at the t-th moment on the j-th day.

Nonlinear Mixed-Effects Model
We let X t be the generated power at the t-th instant in time on a given day: for t = 2, . . . , n j . Similarly, we let I t be the recorded irradiance at the t-th instant in time on that day.
We assumed that the X t values were generated according to the following model: where h(t|θ ) is a nonlinear growth model, θ represents the parameters of the model and ν t is a random error. Since X t was a positive value, then we assumed that the random error ν t came from a log-normal distribution with the mean µ and variance σ 2 of ν t ∼ LN (µ, σ 2 ) for t = 1, . . . , n j . We set h(t|θ ) as the logistic growth function [24]: where θ = (α , β 0 , γ) represents the model parameters for t = 1, . . . , n j . This model had an S-shaped curve that was defined by two distinct phases. A positive slope characterized the first phase, which showed that the growth rate was increasing, and the second phase was characterized by a negative slope, which showed that the growth rate was decreasing. The point at which the slope of the curve changed (i.e., from positive to negative) was the inflection point. The coordinates of the inflection point were log(β 0 ) γ , α 2 . By applying a logarithmic transformation on both sides of Equation (2), the log-logistic model was given by: where α 0 = log(α ) and ε t = log(ν t ) for t = 1, . . . , n j . Thus, we obtained ε t ∼ N (µ, σ 2 ) for t = 1, . . . , n j . Hereinafter, we set µ = 0. Likewise, we let Z t = log(I t ) for t = 1, . . . , n j . To relate the response variable Y to the explanatory variables Z and M, we considered the following hierarchical model: in which the vector of random effects (a t , b t ) is bivariate normally distributed with zeromean and variance-covariance matrix: Since the measurements were registered longitudinally, the errors could be heterogeneous and correlated. In order to model both structures of the dataset, we assumed that the vector of the random errors ε j = (ε 1 , . . . , ε n j ) from the j-th day, is generated from a n j -variate normal distribution with vector-mean 0 and variance-covariance R j of dimension n j × n j , for j = 1, . . . , d. This covariance matrix had to account for the heteroscedasticity and autocorrelation of the measurements that were registered on the j-th day. Following the proposal of Davidian and Giltinan [25] and Xu et al. [26], we considered: where G j and H j are the n j × n j matrices that account for the error variance and autocorrelation of the measurements from the j-th day, respectively, for j = 1, . . . , d.
We modeled the variance as a power function by letting varpower(ε t ) = σ 2 t 2δ for a power of δ ∈ R. The matrix G j was a diagonal matrix with the diagonal elements of t 2δ for t = {1, . . . , n j } and j = 1, . . . , d. For the autocorrelation structure, we considered a first-order autoregressive model (AR 1 ). From this assumption, when ε t and ε(t + s) were two random errors that were separated by s units of time, then cor = (ε t , ε (t+s) ) = ρ |s| for t = 1, . . . , n j and j = 1, . . . , d. Matrix H j was given by: Using Equation (6), the proposed NLME model was given by: We let θ = (α 0 , α 1 , α 2 , β 0 , β 1 , β 2 , γ, σ 2 , σ 2 b , σ 2 c , ρ) represent all of the model parameters that had to be estimated using the dataset. These parameters were estimated via the maximization of the likelihood function using the algorithm that was proposed by Lindstrom and Bates [20].

Fitted Model
The results of the fit of ten regression models using the dataset (Section 2.2) are presented in this section. The considered models are described in Table 3, where A t and B t are given by Equation (6) and C t = γ 1 Z t + γ 2 M t + c t for c t ∼ N (0, σ 2 c ).
Model M 0 was a linear predictor and model M 1 was given by the log-logistic function in Equation (5). Models M 2 , M 3 and M 4 were given by log-logistic functions with random effects in just one parameter. Models M 5 , M 6 and M 7 were given by log-logistic functions with random effects in two parameters. Model M 8 was given by a log-logistic function with random effects in three parameters. Opposite to the proposed model that was described in Equation (8), models M 2 to M 8 considered that the random effects were independent from one another and that the random errors were not correlated. Hereafter, we call the proposed model M 9 .
In order to obtain estimates for the parameters of models M 1 to M 8 , we used the R software [27] and the nlme command [28]. To obtain estimates for the parameters of model M 9 , we included the options correlation = corAR1() and weights = varPower() in the nlme command. The fitted models were compared using the mean squared error (MSE) and the AIC and BIC selection criteria metrics. The best model was the one that had the lowest values for the MSE, AIC and BIC. Table 4 shows the MSE values and the comparison criteria for the ten (M 0 − M 9 ) fitted models. The smallest values are highlighted in bold. As can be seen from the table, the three criteria indicated that the proposed model (M 9 ) was the best. Table 3. The considered models.  Table 5 shows the estimates for the fixed parameters of model M 9 . A significance level of ω = 0.10 was used for the explanatory variables of irradiance and accumulated mass for this model. Since β 2 > 0 (the coefficient of variable M), the accumulated mass had a negative effect on the growth of the curve. By increasing M, the slope of the curve reduced, indicating a slower growth and, consequently, a smaller amount of generated solar power. Since β 1 < 0 (the coefficient of variable Z) indicated that when the irradiance value increased, the slope of the curve increased, this meant that more solar power was generated. Therefore, the fitted model was given by:

Model Validation
As the best fit model was the NLME model M 5 , we validated the results by comparing the mean squared error (MSE) metric of the predicted and recorded (observed) values. In addition, the residuals of the model are also presented and discussed in this section. Figure 6 shows the observed values (• symbols) and the predicted values (red line) from days 1 and 2. The MSEs for days 1 and 2 were 0.00013 and 0.00006, respectively. Figure 7 shows the daily MSE for the first 21 days. During that period, the MSE ranged from 0.00006 to 0.00613, thus showing the high performance of the best fit model.     Figure 8b highlights the close relationship between the fitted (predicted) and observed values. The linear correlation coefficient between the predicted and observed values was 0.9997.
Regarding the model validation and the higher accuracy that was achieved by the NLME model for the samples from the first 21 days, the model was then evaluated using samples from the remaining 84 days. Figure 9 shows the daily MSEs from those 84 days, for which the values ranged from 0.00021 to 0.01756. The linear correlation coefficient between the predicted and observed values was 0.9996, indicating that the model had a high performance and a fairly similar correlation to that from the testing period.    It can be seen that the environmental variables could impact the accuracy of the model. Figure 10 presents the rainfall precipitation levels and particle concentrations during the validation and evaluation period (105 days). A higher rainfall precipitation level had a straight effect on the particle concentrations in the air. This effect made it possible to correlate soiling on the PV modules to the effects of natural clean-up (rainfall). It is worth noting that the high precipitation levels between the months of November and March could explain the lack of accuracy (see the MSE peaks in Figures 7 and 9) of the model.   The instability of the power generation (high and low values) throughout this period was due to cloudy conditions at the location of the PV system (mainly before and after rainfall). As rainfall became sparser (from March), the differences between the predicted and observed generated power values became more significant due to an increase in soiling deposition on the PV modules.
Another way to evaluate the performance of the proposed prediction model was to look at the percentage error of the generated power. Equation (10) represents the daily percentage power difference between the recorded and predicted values: where O d = exp{Y n d } and E d = exp Ŷ n d for d = 1, . . . , D = 105. When P d < 0, the recorded generated power on the d-th day is less than the predicted (expected) generated power, meaning that soiling could be impacting power generation on that day. Engineers who are responsible for PV plants should also be aware that there could be electrical failures in the PV system. A value of P d > 0 means that the recorded generated power is higher than the predicted generated power. The reason for a positive difference could be that there is less soiling on the PV modules than estimated by the prediction model. This behavior could be associated with the natural clean-up of the modules in the PV array due to rainfall. PV system users should also pay attention when P d = 0 as this situation does not mean that there is no soiling on the modules, but rather that the predicted and observed values are quite similar. When P d = 0, users should observe whether the daily power generation curve is maintaining the same steady growth behavior or is slowing down (which could be evidence of soiling having an impact on power generation) over the given period. Figure 12 shows the percentage differences from the 105 days. For the first 25 days, negative daily P d values were observed. Between days 25 and 60, there were only positive P d values. It can be seen in Figure 10 that these positive P d values could be associated with the high rainfall precipitation levels from November 2019 to March 2020. This behavior corroborated previous explanations of the effects of regular rainfall being responsible for the clean-up of PV modules. Since the model noticed the increase in accumulated mass over time but did not consider natural clean-up procedures (from rainfall), this explains why the predicted values were below the observed values (less than 1.4%). After the 80th day, negative values became more frequent, thus indicating an increase in soiling on the PV modules. The correlation coefficient that was found for the model was r = 0.9999 for the training set and r = 0.9996 for the testing set. The mean squared error (MSE) significantly increased from the training set (MSE = 0.0007) to the testing set (MSE = 0.0032). The test period had more days with low and high precipitation rates, which impacted the forecasting accuracy. The fitted model was evaluated continuously over the period and obtained an overall MSE of 0.0027. The model could have achieved a better accuracy if restart points (i.e., on the d-th day, the model restarts the measurement of the accumulated mass of particles) had been performed just after days with a high precipitation rate. For example, by assuming a minimum threshold of 30 mm of rain precipitation as a restart point for the measurement of the mass accumulation of particles throughout the test period, the MSE decreased to 0.0028 and the overall MSE was 0.0024.

Conclusions
This work presented a new nonlinear mixed-effects (NLME) predictor that was based on a log-logistic function to estimate photovoltaic power generation in solar power plants. The predictor model was designed using irradiance and mass-independent particulates as the exploratory variables. The response variable was the generated power at the instant of time t.
We evaluated ten different statistical predictors, among which the log-logistic model (M 9 ) achieved the best values according to the AIC, BIC and MSE metrics. The predictor was then validated and evaluated using a dataset comprising environmental and PV power samples that were collected across 105 days (from October 2019 to April 2020) at a photovoltaic power plant in Campo Grande-MS, Brazil. The model was trained using the first 21 days from the sample data and the testing set comprised the remaining 84 days. From a practical viewpoint, the results showed that modeling the log-transformed power using a log-logistic model with fixed and random effects was very accurate for predicting power generation in the presence of soiling. To the best of our knowledge, this is the first work to propose the statistical modeling of power generation in PV plants based on a nonlinear mixed-effects model.
The proposed predictor could be widely applied in real large-scale environments to provide accurate generated power estimates and work together with other monitoring tools to track the performance of PV systems. The hardware infrastructure of the predictor is very simple (irradiance and soiling sensors) and most PV plants have standard weather or solarimetric stations from which irradiance and other environmental data are available.
It can be seen that the predictor uses accumulative input data to estimate power generation. During different events at the PV power plant site (e.g., maintenance procedures) or even under certain environmental conditions (e.g., high rainfall precipitation levels), the input variables may need to be restarted to cope with the new state of the PV system performance. Another useful advantage of the proposed predictor system is that users may obtain power generation estimates for short time intervals (hourly power generation) or large time intervals, for instance, at the end of the day (daily power generation).
The predictor behavior provided insights into the best time to perform clean-up procedures on PV arrays. A correct cleaning schedule for PV modules avoids unnecessary maintenance costs and maximizes profits by allowing the PV system to work at full capacity.
Some opportunities for future work include applying the model to a larger period, thus evaluating the power estimates under different weather conditions and seasons. Another subject for further research is the analysis of the thresholds for the training and testing sets to avoid model overfitting. Other targets for future work include applying the model to photovoltaic power plants with different module technologies or different soiling detection sources, such as images from cameras and satellites, and the use of machine learning models to estimate soiling on PV arrays.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data and information that support the findings of this article are freely available at https://github.com/lscad-facom-ufms/Solar2 (accessed on 6 April 2022).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: Particles less than 1.0 µm in diameter PM2. 5 Particles less than 2.5 µm in diameter PM4.0 Particles less than 4.0 µm in diameter PM10.0 Particles less than 10.0 µm in diameter PRM Polynomial regression model