Statistics to Detect Low-Intensity Anomalies in PV Systems

The aim of this paper is the monitoring of the energy performance of Photovoltaic (PV) plants in order to detect the presence of low-intensity anomalies, before they become failures or faults. The approach is based on several statistical tools, which are applied iteratively as the data are acquired. At every loop, new data are added to the previous ones, and a proposed procedure is applied to the new dataset, therefore the analysis is carried out on cumulative data. In this way, it is possible to track some specific parameters and to monitor that identical arrays in the same operating conditions produce the same energy. The procedure is based on parametric (ANOVA) and non-parametric tests, and results effective in locating anomalies. Three cumulative case studies, based on a real operating PV plant, are analyzed.


Introduction
A crucial problem for the management of PV plants is the strong dependence of the system response on many extrinsic factors, such as irradiance intensity, cloudiness and pollution, humidity, air velocity, environment temperature, and cell temperature.Several models that are able to evaluate the effects of such uncertainties have been proposed in [1][2][3][4][5][6][7][8][9], whereas the statistical methods have been used for the design of PV plants [10,11] and for the suitable choice of the electrical components [12,13].When a PV plant is operating, a monitoring system to check the performance in all the environmental conditions is needed.
PV modules are the main components of a PV plant, so a deep attention has to be focused on their health state.For this aim, techniques that are commonly used to verify the presence of typical defects in PV modules are based on the infrared analysis [14,15], on the luminescence imaging [16], on their combination [17], while automatic procedures to extract information by the thermo-grams are proposed in [18,19].Nevertheless, these approaches regard single modules of PV plants and they are useful when a defect has been roughly individualized.When there is no information about the general operation of the PV plant, other techniques have to be considered, as neural networks [20], check of the electrical variables [21], AutoRegressive Integrated Moving Average (ARIMA) models [22], statistics, and so on.For example, in [23], the authors proposed a decision algorithm based on both the descriptive and inferential statistics.The former one is used to characterize the data-set under investigation, on the basis of a descriptive model or a distribution family.The latter one is used for unknown dataset and consists of a data producing process with the aim to infer the characteristics of the population on the basis of a sub-set of the sample data, in order to predict anomalies, even though the amount of field measurements is limited.Nevertheless, when important failures/anomalies (short circuit, islanding, and so on) occur, the electrical variables and the associated energy have fast and not negligible variations, so they are instantaneously detected.These events can be classified as high-intensity anomalies, because they produce a drastic change.Instead, the low-intensity anomalies (ageing of the components, minimal partial shading, corrosion and so on) produce a minimal variation on the values of the electrical variables and produced energy, so it is not trivial to detect them.Moreover, these anomalies can evolve in failures or faults, so a fast and correct identification can prevent them and limit the out of order.This paper proposes a cheap, easy, and fast statistical based on an algorithm that processes the data usually stored in the datalogger (Easy, Fronius, Bussolengo (VR), Italy) of the PV plants, therefore not requiring new hardware/components.
The proposed methodology evaluates a priori the presence of high-intensity anomaly in each array of the PV plant, then discriminates the statistical distributions between unimodal and multimodal, and finally applies several statistical tools to extract definitive conclusions about the similarity among the energy produced by the arrays, in order to detect and locate possible anomalies.
The paper presents the methodology in Section 2, describes the characteristics of a real PV plant under investigation in Section 3, and discusses the results in Section 4. Conclusions end the paper.

Statistics-Based Procedure
The random variability of the environmental conditions affects the irradiance intensity for the PV generators.During clear days an analytic expression for solar irradiance can be defined, whereas it is not possible for cloudy days, when, instead, the statistical approach results are very effective.In this paper, we consider that the PV plant is composed of k identical arrays with a devoted unit of measurement for each array.Each unit stores measurements of voltage, current and energy produced, whereas a central unit acquires the data coming from all the arrays.For our aims, let us consider the dataset constituted by the energy values.The descriptive statistical parameters (mean, median, and variance) are valuable tools to detect large failures (short circuits, open circuits, large shading, etc.) that provoke not negligible energy reductions.Instead, when a low-intensity anomaly appears in an array (light ageing, partial shading, and so on), the energy reduction is limited, and the previous parameters are not able to distinguish an anomaly from an acceptable tolerance.For this aim, here, we propose the methodology reported in Figure 1, based on the comparison among the energy performances of the arrays.This goal can be pursued by means of parametric or non-parametric tests.When the data distribution is not normal, a non-parametric test must be used.As the non-parametric tests make only mild assumptions, they are less powerful than the parametric ones for normally distributed data.So, it is preferable to use a parametric test each time that it is possible.Among the parametric tests, the Analysis of Variance (ANOVA) [24] is an effective tool for inferential purposes and returns reliable feedback, when comparing the variance inside each array distribution and among the arrays' distributions.In other words, ANOVA evaluates whether the differences of the mean values of the different groups are statistically significant or not.So, the null hypothesis H 0 is that the means µ i are equal, i.e., H 0 : versus the alternative one that at least one distribution has mean different from the others.Nevertheless, it can be applied only under three assumptions: (a) equal variance for all the distributions; (b) all the distributions are gaussian; and, (c) all of the observations are independent each other.
In our case, the hypothesis (c) is always verified, because the k local measurement units are different; moreover, a modest violation for the assumptions (a) and (b) are allowed, because ANOVA is a robust test.
So, before applying ANOVA test, several verifications are needed.The first one regards the unimodality of the statistical distributions; in fact, a multimodality distribution (Figure 2) is surely not normally distributed, thus violates the condition (b).The check on the unimodality can be done by means of skewness and kurtosis, as will be explained later.The skewness of a distribution is defined as: being µ the mean of the data x, σ the standard deviation of x, and E(z) the expected value of the quantity z.From a mathematical point of view, the skewness is a third standardized moment and it measures the asymmetry of the data around the mean.For • σ k = 0 the distribution is gaussian; • σ k > 0 the data are spread out more to the right of the mean than to the left; and, • σ k < 0 the data are spread out more to the left.
Pay attention that σ k = 0 is a necessary but not sufficient condition for the symmetry; in fact, symmetric distributions with σ k = 0 cannot exist, but asymmetric distributions with σ k = 0 can.
The kurtosis, instead, is a measure of how outlier-prone a distribution is.Since the kurtosis can have several formulas, here, we consider the Pearson's kurtosis less 3, i.e., From a mathematical point of view, the kurtosis is a fourth standardized moment and results that: distribution is less outlier-prone than the gaussian one; and, • k u > 0 the distribution is more outlier-prone than the gaussian one.Now, it results that skewness and kurtosis of a unimodal distribution have to satisfy the following constraint [25]: whether the distribution mode is equal to the distribution mean, and kurtosis is calculated as in (3).Instead, a more inclusive constraint, valid whether the distribution mode is different from the distribution mean, is [26]: So, after the calculation of skewness and kurtosis, and of the mode and mean values of each distribution, if the constraint (4) (when mode = mean) or (5) (when mode = mean) is not satisfied, it needs to apply the Kruskal-Wallis test (K-W) [27,28] or Mood's median test (MM), i.e., a non-parametric test, which requires only that the measurements come from a continuous distribution, even though it is not gaussian.Moreover, MM is preferred to K-W, when outliers are present.
Instead, K-W studies the variance based on the ranks of the data values and not the data values themselves.K-W is based on the same null hypothesis (1), i.e., that the data belong to k distributions having equal mean values.So, if p-value < α = 0.05 , the null hypothesis (1) can be refused, implying that the k arrays are not producing the same amount of energy, a low-intensity is present, and an alert is notified, highlighting which are the worst performing arrays.The parameter α is the well-known significance level.
Instead, if the constraint (4) (when mode = mean) or (5) (when mode = mean) is satisfied, ANOVA could be applied under the conditions (a) and (b).Now, the check on the variances (condition (a)) can be made by means of the homoscedasticity's test (also known as homogeneity of variance), while the condition (b) can be verified by means of a normal probability test [29].The normal probability plot returns the range of percentiles where the distribution is gaussian.If the above assumptions are verified, then it is possible to apply ANOVA test to obtain the information related to the p-value, otherwise a non-parametric test (K-W or MM) has to be applied.
The final step is the analysis of all the calculated statistical parameters.A support is provided by the Pearson's distribution that we have used as approximation function, because we observed that the kurtosis is necessary to fit the data.In particular, the Pearson proposed is a Pearson Type IV Energies 2018, 11, 30 5 of 12 (p4PDF), which is an asymmetric version of the Student's t distribution.By considering the first four standardized moments mean µ, standard deviation (STD) σ, skewness σ k , and kurtosis k u , the p4PDF can be written as: where x = x − µ σ , whereas: (7) And: In ( 7) n, κ, a, and λ are real-valued parameters, and −∞ < x < ∞. k 2 is a normalization constant that depends on n, κ, a, and it can be expressed by [30]: As new data are acquired, the fault estimation and its location becomes more accurate.The proposed algorithm extracts statistical information from the produced energy by the arrays and performs a continuous supervision of the operation of the PV plant.
As in real cases, it is unlikely that the data distribution is perfectly gaussian and since ANOVA test is allowed also for a modest violation of condition (b), skewness and kurtosis, already defined in ( 2) and ( 3), can be used to quantify the divergence of the real distributions of the k arrays from a gaussian one and to select the most suitable path.

Case Study
The behavior of a 19.8 kWp grid-connected PV plant, located in the South of Italy, has been analyzed.The 132 panels of the plant are partitioned in k = 6 equal arrays.Each PV module (Sol 150 mono-crystalline, Solterra, Chiasso, Switzerland) has a nominal power of 150 Wp, so the peak power of each array is 3300 Wp.A 3000 W inverter (Sunny Boy 3000, SMA, Milano, Italy) is connected to each array.The system faces the south, and it is sloped at about 44 • .The PV plant is on the roof of a private company building that is taller than any other obstacle around the same building.The inverter will be connected to the grid, only if the PV voltage of the array exceeds a prefixed threshold, which is checked by an internal regulation system of the inverter, and it is able to capture the MPP voltage.
The PV plant has a data acquisition system, constituted by a datalogger that acquires the data from the six inverters each 2 s.An internal software calculates the average value of the sampled data each 10 min and stores only this value into the database.The daily and cumulative values calculated and stored into the database are: (a) power and energy on the AC side of each inverter; (b) voltage V dc on the DC side of each inverter; and, (c) total number of the working hours.The monitoring system can store up to 400 days.The investigation period refers to a full year, during which the plant has shown several anomalies.

Cumulative Statistical Analysis
To analyze the energy performance of the PV plant described in Section 3, a statistics-based algorithm introduced in Section 2 has been used.All of the analyses have been run in Matlab (R2017, MathWorks, Natick, MA, USA) environment by using standard routines and implementing the proposed algorithm.
Several analyses are presented, corresponding to several cycles of the algorithm of Figure 1, as the new data were acquired.So, for each new case study, new samples are added to the previous ones.It should be highlighted that the data have been filtered before inserting them in the proposed routine.The pre-processing is always needed, because some bad data or outliers could be present.
Thus, the incoming analyses represent a cumulative analysis for the statistical monitoring of the PV plant, in order to follow the time behavior of some benchmarks and to detect the low-intensity anomalies.Several analyses will be presented: • monthly analysis (January); • quarterly analysis (January-March); and, • yearly analysis (January-December).
The following results will be reported: • mean, median, variance and relative spreads of each array, in order to verify whether any large failure is present; • skewness and kurtosis values, in order to evaluate the unimodality U or U* of the k distributions, and also to quantify the mismatches with respect to a gaussian distribution; and, • p-value, as explained in Section 2, having fixed α = 0.05.
Observing Figure 1, the three analyses can be considered as the cumulative result of the flow-chart, starting from the data of one-month and adding, firstly, the new data of the successive two months (producing the quarterly analysis), and then adding the new data of the successive nine months (producing the yearly analysis).The procedure here described has been applied to the energy dataset of the same PV plant for two years.In this paper, we discuss only the application to the energy dataset of the second year, highlighting that in the first year no anomaly was revealed, the PV plant had produced the expected energy, the arrays had produced about the same energy.

Monthly Analysis (January)
Figure 3 reports the energy produced by each array while Table 1 reports some statistical parameters (mean, variance and medians of the energy produced by each array, their global means, and the spreads in per cent).It results that the spreads of the means are limited (the out of order of one PV module among the 22 of each array implies about 5% of energy reduction, thus no PV module is out of order), so large failures are not present.In order to evaluate the presence of a low-intensity anomaly, let us apply the flow-chart of Figure 1.To evaluate the unimodality, mode, skewness, and kurtosis are calculated (see Table 1), and are used in (5), because mean and mode are different for each array.For each array, the values of U* (Table 1) are less than 186/125 = 1.488, thus the constraint ( 5) is satisfied and each distribution is unimodal.Now, the ANOVA's constraints would have to be verified by means of homeschedasticity test and normal plot; nevertheless, it can be observed that the values of skewness and kurtosis are very different from zero for each array, so the distributions are surely not-gaussian, Energies 2018, 11, 30 7 of 12 violating the condition (b) of ANOVA test.Moreover, the spreads of the variances (contained in the range −2.81 ÷ 3.46) indicate a not negligible violation also of the condition (a).Therefore, ANOVA cannot be applied, so K-W is chosen, because outliers are not present (see Figure 3).Table 1 reports also the p-value of K-W (0.9999) which implies that: • p-value > 0.05, so the null hypothesis H 0 in (1) cannot be refused; and, • 1-p-value < 0.05, so the alternative hypothesis that at least one distribution has the mean different from the other ones has to be rejected.
module is out of order), so large failures are not present.In order to evaluate the presence of a low-intensity anomaly, let us apply the flow-chart of Figure 1.To evaluate the unimodality, mode, skewness, and kurtosis are calculated (see Table 1), and are used in (5), because mean and mode are different for each array.For each array, the values of U* (Table 1) are less than 186/125 = 1.488, thus the constraint ( 5) is satisfied and each distribution is unimodal.Now, the ANOVA's constraints would have to be verified by means of homeschedasticity test and normal plot; nevertheless, it can be observed that the values of skewness and kurtosis are very different from zero for each array, so the distributions are surely not-gaussian, violating the condition (b) of ANOVA test.Moreover, the spreads of the variances (contained in the range −2.81 ÷ 3.46) indicate a not negligible violation also of the condition (a).Therefore, ANOVA cannot be applied, so K-W is chosen, because outliers are not present (see Figure 3).Table 1 reports also the p-value of K-W (0.9999) which implies that:  p-value > 0.05, so the null hypothesis H0 in (1) cannot be refused; and,  1-p-value < 0.05, so the alternative hypothesis that at least one distribution has the mean different from the other ones has to be rejected.

Discussion
The proposed procedure has highlighted that the energy performance of the six arrays are similar for the first three months, however this is not true for whole year, so the proposed cumulative analysis has highlighted a low-intensity anomaly.The procedure is not able to identify the origin of the anomaly, but it detects and locates it.In fact, an in-depth view on the spreads of the means reported the Table 3 highlighted the anomaly in the array n. 5 (spread of the mean equal to −1.79%).While the first two analyses have not revealed that the spreads were too critical (p-value was highest), the yearly based analysis has.Evidently, the anomaly was initially latent, and it has become detectable over time.This shows the effectiveness of the cumulative application of the proposed procedure.Again, array n. 1 is always the best performing, whereas the other ones have oscillating behavior.These considerations are also confirmed by the spreads of the median that assigns always the maximum positive value to the array n. 1 and the maximum negative value to the array n. 5. Figure 4 reports the probability density function computed considering a Pearson distribution in order to take into account all the four moments (mean, variance, skewness, and kurtosis) for the three investigated periods: 1 month, 3 months, and 12 months.Two main observations are possible.Firstly, all the curves in Figure 4a,b (monthly and quarterly, respectively) are spread out on the right, whereas all of the curves in Figure 4c (yearly) are spread out on the left.Therefore, although the environment conditions vary during the year, but equally affect the arrays of the PV plant, the distribution functions (see Figure 4a-c) of all of the arrays are very similar for each investigated period, highlighting that no fault or high-intensity anomaly has happened in that year.Secondly, the curves in Figure 4a,b are quite close each other, thus highlighting the similarity among the arrays from the energy point of view.Instead, the curves in Figure 4c are more spaced each other, highlighting that the energy performances of the arrays in the whole year are different, even if they were similar for the first three months.Then, low-intensity anomaly is surely present and the previous numerical analysis has also allowed for locating it.In conclusion, the diagram of Pearson distribution is useful to get preliminary and qualitative information on the several arrays if their distributions are unimodal, while the procedure of Figure 1 allows for identifying the modality of each distribution, quantifying some statistics for each array, and comparing each other the energy performance, such that possible low-intensity anomalies are detected and located.

Conclusions
The paper proposes a procedure for the statistical analysis of the energy performance of PV plants, in order to detect the presence of low-intensity anomalies.The procedure is cumulative and some statistical benchmarks are calculated as other measures are done.Experimental results on the

Conclusions
The paper proposes a procedure for the statistical analysis of the energy performance of PV plants, in order to detect the presence of low-intensity anomalies.The procedure is cumulative and some statistical benchmarks are calculated as other measures are done.Experimental results on the yearly data acquired by the datalogger of a real PV plant are shown for three analyses, with the aim to explain point-to-point the iterative procedure for a complete year; nevertheless, the proposed approach is useful also for the real-time monitoring, if rigorous performance benchmarks are fixed.In this way, the trend of the benchmarks can be followed and the anomalies can be detected before they become faults.
If benchmarks exceed prefixed thresholds, alert messages can be sent, or a control procedure can be implemented.Obviously, the number of applications of cumulative analysis for detecting an anomaly depends on its severity.Real case study has shown the effectiveness of the proposed approach.We observed that the cumulative analysis of the whole period (12 months) gives information about the low-intensity anomaly of the array n. 5, as highlighted from the data of Table 3.Therefore, the method shows that a low-intensity anomaly can be identified, although it is not possible to understand the origin of the anomaly.Finally, by storing the statistical parameters of each run of the proposed algorithm, they can be compared year per year (or the same month/trimester of successive years), in order to track the statistical parameter of each single array.

Figure 2 .
Figure 2. Different modality.Instead, K-W studies the variance based on the ranks of the data values and not the data values themselves.K-W is based on the same null hypothesis (1), i.e., that the data belong to k distributions having equal mean values.So, if  − value < α = 0.05, the null hypothesis (1) can be refused, implying that the k arrays are not producing the same amount of energy, a low-intensity is present,

Figure 3 .
Figure 3. Energy, in kWh, produced by each array during one-month.

Table 1 .
Mean, median and variance of the daily energy (in kWh) produced by each array and relative spreads for one month.Skewness, kurtosis, and p-value.Energy, in kWh, produced by each array during one-month.

Table 1 .
Mean, median and variance of the daily energy (in kWh) produced by each array and relative spreads for one month.Skewness, kurtosis, and p-value.