1. Introduction
The random variability of the atmospheric phenomena affects the available irradiance intensity for PhotoVoltaic (PV) generators. During clear days an analytic expression for solar irradiance can be defined, whereas it is not possible for cloudy days. Extreme-case conditions are usually assumed as reference, for the sake of check, neglecting the inherent random nature of some aspects affecting the electrical characteristics of a PV system. Several models that are able to take into account the effects of the environmental conditions have been proposed in [
1,
2,
3]. When a PV plant is operating, a monitoring system to check the performance in all of the environmental conditions is needed. PV modules are the main components of a PV plant, so a deep attention has to be focused on their state of health [
4]. For this aim, techniques that are commonly used to verify the presence of typical defects in PV modules are based on the infrared and/or luminescence analysis [
5,
6], while the automatic procedures to extract information by the thermo-grams are proposed in [
7,
8]. Nevertheless, these approaches regard single modules of PV plants and they are useful when a defect has been roughly individualized. When there is no information about the general operation of the PV plant, other techniques have to be considered to predict failures and to enhance the PV system performance, as neural networks [
9] or electrical model [
10]. Some authors propose statistics-based approaches [
11,
12,
13]. Other researchers evaluate the presence of faults, by monitoring the electrical signals [
14,
15]. Instead, predictive model approaches for PV system power production based on the comparison between measured and modeled PV system outputs are discussed in [
16,
17]. For example, the international rule [
18] defines some indices (final PV system yield, reference yield, Performance Ratio), that are usually used to monitor the performance of a PV plant with respect to the energy production, the solar resource, and the system losses. These indices have been used for two interesting and recent studies about the energy performance of PV plants. The first one has focused attention on 993 residential PV systems in Belgium [
19], whereas the second one has studied 6868 PV plants in France [
20]. Unfortunately, these indices show two criticalities: (a) they supply information about the performance of the entire PV plant; and, (b) no feedback is returned about the correct operation of single parts of the PV plant. Moreover, these monitoring approaches are based on the environmental parameters that are not always available, over of all for small-medium size PV plants.
Now, when an important fault as short circuit or islanding occurs, the electrical variables and the energy have fast and not negligible variations, so they are easily detected. These events produce drastic changes and can be classified as high-intensity anomalies. Instead, the low-intensity anomalies as ageing of the components or minimal partial shading produce minimal variations on the values of the electrical variables and on the produced energy, so it is not trivial to detect them. Moreover, these anomalies can evolve in failures or faults, so a fast and correct identification can prevent them and limit the out of order. This paper proposes a cheap and fast statistical methodology that is based on an algorithm, which processes the data usually acquired by the measurement unit of the PV plants, therefore, it does not require additional hardware/components. The issue of detecting the low-intensity anomalies by means of statistical tools has been addressed in [
12], where the check on the unimodality of the energy distributions is based on the values of skewness and kurtosis. These statistical moments are also used to check whether the distributions are normal or not, whereas the check on the variances of the energy distributions is based on the homoscedasticity. In the next section, all of these critical points will be discussed in detail.
The proposed methodology is devoted to the small-medium-size PV plants, constituted by several arrays, and does not require the environmental data, as solar irradiance or cell temperature. It analyzes the dataset of the energy produced by each array and extracts the features of their statistical distributions, in order to choose the best performing statistical tool to use. Depending on the modality (unique or multiple) of the distributions and on other statistical parameters, a parametric or a non-parametric test is used, in order to evaluate whether identical arrays, in the same unknown environmental conditions, produce the same energy. The procedure is cumulative, and then new data are added to the initial dataset, as they are stored on the datalogger. The case study presented in this paper regards a real operating PV plant and three applications of the methodology will be discussed: the first one, based on the energy dataset of one month; the second one, based on the energy dataset of six months; the last one, based on the energy dataset of one year. The monitoring of the statistical parameters and of their mismatches with respect to the benchmarks allows detecting and locating possible anomalies, before they become failures.
The paper is structured as follows:
Section 2 introduces the statistical algorithm,
Section 3 describes the PV plant under investigation,
Section 4 presents the results of the cumulative statistical analysis, and the Conclusions end the paper.
2. Statistical Methodology
In this paper, we consider that the PV plant is composed of
N identical arrays, with each of them being equipped with a measurement unit, which measures the values of voltage and current of both the
DC and
AC sides of the inverter, and the energy produced by the array, with a fixed sampling time, ∆
t. At the generic time-instant
t = k∆
t of the
d-th day, the
k-th sample vector of the
n-th array is defined as
, for
,
(being
D the number of investigated days),
, where
k = 1 characterizes the first daily sample at the time
t = ∆
t and
k =
K defines the last daily sample, acquired at the time
For our aims, let us consider the dataset constituted by the energy values
. The
n-th array, at the end of the
d-th day, has produced the energy
, therefore the complete dataset of the produced energy by the PV plant in the whole investigated period can be represented in a matrix form:
The columns of the matrix (1) are independent each other, because the values of each array are acquired by devoted acquisition units, so no inter-dependence exists among the values of the columns, which can be considered as separate statistical populations. The flow chart in
Figure 1 proposes a methodology to verify the energy performance of a PV plant and to detect and locate any anomaly before it becomes a failure. The methodology is particularly useful when the PV plant is not equipped with weather station, because, in this case, the evaluations about the energy performance cannot be carried out with respect to the solar radiation, cell temperature, and so on.
The absence of the weather station is very frequent for PV systems with nominal peak power up to 100 kWp. So, the proposed methodology allows for supervising the energy performance of a PV plant on the basis of a mutual comparison among its arrays, with no environmental data as input. Obviously, this approach is valid, only if the arrays are identical (same PV modules, same number of modules, same slope, same tilt, same inverters, and so on); in fact, under this assumption, the energy productions of the arrays must be almost identical in each period and in the whole year (the changing environmental conditions affect the arrays in the same way, if they are installed next to each other). So, the comparative and cumulative monitoring of the energy performance of identical arrays allows to declare, within the uncertainty defined by the value of the significance level α, if the arrays are producing the same energy or not. This goal can be pursued by using the parametric tests or the non-parametric tests. Since the parametric tests are based on known distribution, they are more reliable than the non-parametric ones, which are, instead, distribution-free. For this reason, it is advisable to use always the parametric tests, provided that all of the needed constraints are satisfied. In particular, the parametric test known as ANalysis Of VAriance (ANOVA) [
21] compares the variance inside each array distribution and the variance among the arrays’ distributions. In other words, ANOVA evaluates whether the differences of the mean values of the different groups are statistically significant or not. ANOVA is based on the null hypothesis
H0 (Equation (2)) that the means of the distributions,
, are equal:
versus the alternative hypothesis that the mean value of at least one distribution is different from the others. In rigorous way, ANOVA can be used under the assumptions:
- (a)
all the observations are mutually independent;
- (b)
all the distributions have equal variance; and,
- (c)
all the distributions are normally distributed.
Nevertheless, ANOVA can be applied also for limited violations of the assumptions (b) and (c), whereas the assumption (a) is always verified, if the measures come from independent local measurement units.
So, before applying ANOVA test, several verifications are needed and they are the blue blocks of
Figure 1. The first check regards the unimodality of the dataset of each array, because a multimodality distribution, e.g., the bimodal distribution in
Figure 2, is surely non-normal. The Hartigan’s Dip Test (HDT) is able to check the unimodality [
22] and is based on the null hypothesis that the distribution is unimodal versus the alternative one that it is at least bi-modal.
Usually, the test hypothesis is defined with the significance value α = 0.05; so, if
p-value
, then the null hypothesis is rejected, when considering it as acceptable to have a 5% probability of incorrectly rejecting the null hypothesis. This is known as type I error. Smaller values of α are not advisable to study the data of a medium-large PV plant, because the higher complexity of the whole system requires a larger uncertainty to be accepted. The verification of the unimodality can be also done on the basis of a relationship between the values of skewness and kurtosis [
23,
24]; nevertheless, in this paper only HDT will be used, because it is usually more sensitive than other methods.
If the unimodality check is not passed, the distribution is not gaussian and ANOVA cannot be used, then a nonparametric test has to be applied. In the general case of
N arrays, with
, the nonparametric test to be used has to be chosen between Kruskal-Wallis test (K-W) [
25,
26] and Mood’s Median test (MM), which do not require that the distributions are gaussian, but only that the distributions are continuous. In presence of outliers, MM performs better than K-W.
Instead, if the unimodality is satisfied, other checks are needed, before deciding whether ANOVA can be applied. First of all, it is needed to verify the previous assumptions (b) and (c). If both of them are satisfied, ANOVA can be applied. Otherwise, since ANOVA is effective also for modest violations of those assumptions, then it is needed to quantify the entity of the violations. With this in mind, the condition (b) on the variance can be verified by means of the Homoscedasticity’s Test (HT), which is again a hypothesis test that returns a
p-value. So, also in this case, it is possible to fix the significance value α = 0.05 (accepting the 5% of probability of type I error) and to compare it with the
p-value. If the inequality
p-value
is satisfied, the variances of the distributions of the arrays are different, thus, the condition (b) is violated, and ANOVA cannot be applied. In this case, it is necessary to use K-W or MM, depending on the presence or absence of the outliers. Otherwise, ANOVA could be applied, if even the condition (c) is satisfied, eventually with a modest violation. To check the condition (c), an effective tool is the normal probability plot [
27], which returns information about the range of percentiles that fall into the normal distribution. If the feedback from the normal plot is not exhaustive to decide if the condition (c) is satisfied, then it needs to calculate the values of skewness (Equation (3)) and kurtosis (Equation (4)) of each distribution [
19]. Skewness,
, is defined as:
where
µ is the mean of the data
x,
σ is the standard deviation of
x, and
E(
t) represents the expected value of the quantity
t. The skewness is the third standardized moment and measures the asymmetry of the data around the mean value. Only for
the distribution is symmetric; this is a necessary, but not sufficient condition for a gaussian distribution. In fact, while the gaussian distribution is surely symmetric, nevertheless there exist also symmetric but not gaussian distributions. Therefore, the only value of the skewness is not exhaustive. For this reason, it is needed to calculate also the kurtosis,
, here defined as the Pearson’s kurtosis less 3 (also known as excess kurtosis) and calculated as:
The kurtosis is the fourth standardized moment and measures the tailedness of the distribution. Only for
, the distribution is mesokurtic, which is the necessary but not sufficient condition for a gaussian distribution. Other details can be found in [
19]. The set of the two indices (skewness and kurtosis) allows for quantifying the mismatch of the distribution with respect to a gaussian one and to evaluate if the violation of the condition (c) is acceptable or not, even if the maximum acceptable mismatches are not unique [
28,
29,
30]. Only if also this verification is passed, the ANOVA test can be applied; otherwise a non-parametric test has to be used.
As new data are acquired and the size of the energy dataset increases, the monitoring becomes more accurate, allowing for the estimation and location also of a low-intensity anomaly, before it becomes a fault.