Intelligent and Data-Driven Fault Detection of Photovoltaic Plants

: Most photovoltaic (PV) plants conduct operation and maintenance (O&M) by periodical inspection and cleaning. Such O&M is costly and inefﬁcient. It fails to detect system faults in time, thus causing heavy loss. To ensure their operations are at an ideal state, this work proposes an unsupervised method for intelligent performance evaluation and data-driven fault detection, which enables engineers to check PV panels in time and implement timely maintenance. It classiﬁes monitoring data into three subsets: ideal period A, transition period S, and downturn period B. Based on A and B datasets, we build two non-continuous regression prediction models, which are based on a tree ensemble algorithm and then modiﬁed to ﬁt the non-continuous characteristic of PV data. We compare real-time measured power with both upper and lower reference baselines derived from two predictive models. By calculating their threshold ranges, the proposed method achieves the instantaneous performance monitoring of PV power generation and provides failure identiﬁcation and O&M suggestions to engineers. It has been assessed on a 6.95 MW PV plant. Its evaluation results indicate that it is able to accurately determine different functioning states and detect both direct and indirect faults in a PV system, thereby achieving intelligent data-driven maintenance.


Introduction
Nowadays, PV energy represents the third-largest source of renewable energy after wind and hydro [1,2]. Many countries are developing PV projects to utilize such renewable sources. For example, a massive solar park with 7.2 million PV panels has been built in Egypt to increase its generation capacity [3], and an Iowa farm [4] in the USA uses solar power to generate fuel and fertilizer on-site. In order to increase the efficiency of generating power, PV power plants have shifted their focus from large-scale development to large-scale operation and maintenance (O&M) [5,6]. Under this circumstance, intelligent O&M methods are being widely researched. By employing them, PV power stations are capable of analyzing their operation process automatically, coping with faulty situations timely, thus greatly improving the overall efficiency of maintenance and management.
Most intelligent O&M methods are based on a video/image analysis or monitoring database. Video/image-based methods [6][7][8] utilize UAVs, satellite or 24 h cameras to get videos or images of PV stations and then train deep learning models to detect potential anomalies. However, to obtain a reliable and accurate model requires a large number of labelled samples (anomalies in the PV panels, e.g., short circuit and cell cracking). Yet, quisition. Huang et al. [25] utilizes the AdaBoost algorithm to establish a fault diagnostic model. Momeni et al. [26] uses a graph-based semi-supervised learning (GBSSL) algorithm to identify, classify, locate, and correct faults. Ma et al. [27] focuses on a partial shading scenario, and apply a multiple-output support vector regression (M-SVR) to estimate the shading strength. Chen at al. [28] proposes a random forest (RF) based fault diagnosis model and takes the real-time operating voltage and string currents of the PV arrays as fault features. Compared to the above-mentioned PR, I-V and statistical methods, the prediction methods are data-driven, which learn the diagnosis knowledge based on historical data and is free of the expertise domain background. Moreover, such machine learning-based models can detect faults in real-time and classify their specific type with high prediction accuracy. However, these methods require data to be collected from both normal and faulty conditions. Our proposed method is an unsupervised prediction method, and we do not directly predict which fault occurs. Instead, we predict the expected ideal and worst power generation and then make two comparisons between them, and real-time ones. Hence, we are able to evaluate its real-time performance and identify faults. Our work is novel and advances the area of intelligent O&M in the following aspects: (1) Applying unsupervised detection method. PV panels' performance depends on meteorological conditions, and a large number of faults may appear. It is difficult to get a dataset covering all possible fault scenarios. Thus, some methods must artificially produce labeled anomalous data by intentionally making some open circuit or short circuit to PV panels [29,30]. This undermines the total power generation of PV stations and declines their operation efficiency. On the contrary, our method is unsupervised without relying on the labelled faulty samples, and simply makes use of the existing monitoring data to evaluate operating status and detect anomalies.
(2) Building non-continuous regression models. Considering the special characteristic of non-continuity in PV generation (Non-continuity is caused by the current-limiting nature in photoelectric conversion [1,29,31], where data values are not continuous in the whole scope and some ranges are meaningless and thus no data values are found there in practice.), we build non-continuous regression models. First, we deploy the ensemble treebased regression algorithm that adjusts a tree structure according to the data characteristics and thus handles non-linear functions better [32]. Moreover, we implement clusteringbased modification to the regression predictions so as to ensure that there is no inexistent value in such non-continuous regression tasks. To the best of our knowledge, our method is the first to deal with such non-continuous issues in PV predictive models.
(3) Detecting indirect faults sensitively. Unlike direct faults that lead to conspicuous performance loss and can be identified easily, indirect faults (caused by dust, module degradation and so on) result in such a gradual PV generation loss that many methods fail to detect it [33,34]. Instead, by comparing the real-time measured value with both the upper and lower references, our method can accurately distinguish different states of PV panels, and hence detect indirect faults sensitively and also provide instantaneous alarm of degradation.

Proposed Framework
The main objective of the proposed O&M framework is to enable PV system production to reach its expected level of efficiency intelligently [19]. Therefore, the proposed approach aims at PV system failure detection, performance evaluation, and O&M planning. The notations frequently used in this paper and their descriptions are summarized in Table 1. The concrete steps of the proposed method are detailed in the following sections.
The maximum value of cluster C i Min i , i = 1, 2, . . . , k The minimum value of cluster C i l i {1, 2, 3, . . . , k}, i = 1, 2, . . . , m Class label of the i-th sample in D f A , f B Upper and lower regression models Predictions from the upper and lower model p A ,p B Modified upper and lower predictions, also simplified as a and b α 1 , α 2 , β 1 , β 2 Coefficients that divide up a baseline range w Weather scale factor A general framework of the proposed method is presented in Figure 1. First, we apply data preprocessing (including outlier detection, feature analysis and data pre-classification) to both historical meteorological and PV power datasets. Then, historical data are preclassified into three subsets that represent different operational statuses, namely, ideal period dataset D A , transition period dataset D S and downturn period dataset D B . We apply an XGBoost-based regression algorithm to datasets D A and D B , so as to train upper and lower baseline models of a PV plant's power output. Moreover, since PV power data are noncontinuous, we deploy k-means [35] to cluster hierarchical PV power data and use the statistical results of every cluster to modify the prediction values. Furthermore, due to very low PV generation in bad weather (e.g., rainstorm, blizzard, hail, and sandstorm), we consider its corresponding weather scale factors to revise both references. Thus, by integrating the results of upper and lower reference models, clustering results and weather scale factors, we acquire the final upper and lower reference curves. Comparing the measured power with two reference curves, we can evaluate their performance, detect faults, and carry out intelligent data-driven O&M. It is noted that our method does not use the information related to a PV system's components like inverters, which means that it is a generic data-based method and not limited to a certain type of PV systems. We intend to evaluate the different operating statuses of real-time power generation so as to better implement O&M. Thus, we propose the five stages (determined by two references) and present corresponding definitions, which are detailed in Section III.C. If only using a simple threshold to identify how close the actual power value is to the expected value, there is only one reference indicating the expected generation, so it would be easy and clear to identify whether the PV panels are in the expected state or not. However, it is not qualified to answer the following important questions: which operating situation (ideal or malfunction period) the PV panels are in when the actual generation values are higher than the expected ones, and how to distinguish the downturn period when covered by light-barriers and the malfunction periods when suffering from short-circuit (their power outputs in such cases are far below the expected reference values)? Using only one threshold makes it difficult to make more fine-grained performance evaluations. Therefore, we propose to use two references indicating the expected best and worst generation. With two references, the above questions can be easily answered. Moreover, it is more accurate to determine the operating status and evaluate real-time generation efficiency.

Data Preprocessing
Before a prediction model is applied, the first step is to conduct data preprocessing, including outlier detection, feature analysis, and data pre-classification.

Outlier Detection
Due to the error or failure of sensor data transmission, there are various anomalies in raw PV monitoring data, such as missing, negative, and duplicated values. Apart from conducting basic preprocessing towards such obvious outliers, we pay attention to detecting others, e.g., extreme and unmatched values in the original dataset so as to thoroughly clean the data.
First, we apply classical statistical methods, i.e., box-plot and 3σ criterion, on each single feature and try to detect outliers that deviate far away from most data. Note that, such statistical methods achieve local detection that only identify extreme values in a single feature. As for global detection, we concentrate on unmatched values. For exam- We intend to evaluate the different operating statuses of real-time power generation so as to better implement O&M. Thus, we propose the five stages (determined by two references) and present corresponding definitions, which are detailed in Section 3.3. If only using a simple threshold to identify how close the actual power value is to the expected value, there is only one reference indicating the expected generation, so it would be easy and clear to identify whether the PV panels are in the expected state or not. However, it is not qualified to answer the following important questions: which operating situation (ideal or malfunction period) the PV panels are in when the actual generation values are higher than the expected ones, and how to distinguish the downturn period when covered by lightbarriers and the malfunction periods when suffering from short-circuit (their power outputs in such cases are far below the expected reference values)? Using only one threshold makes it difficult to make more fine-grained performance evaluations. Therefore, we propose to use two references indicating the expected best and worst generation. With two references, the above questions can be easily answered. Moreover, it is more accurate to determine the operating status and evaluate real-time generation efficiency.

Data Preprocessing
Before a prediction model is applied, the first step is to conduct data preprocessing, including outlier detection, feature analysis, and data pre-classification.

Outlier Detection
Due to the error or failure of sensor data transmission, there are various anomalies in raw PV monitoring data, such as missing, negative, and duplicated values. Apart from conducting basic preprocessing towards such obvious outliers, we pay attention to detecting others, e.g., extreme and unmatched values in the original dataset so as to thoroughly clean the data.
First, we apply classical statistical methods, i.e., box-plot and 3σ criterion, on each single feature and try to detect outliers that deviate far away from most data. Note that, such statistical methods achieve local detection that only identify extreme values in a single feature. As for global detection, we concentrate on unmatched values. For example, in one PV plant, when irradiation is more than 1000 W/m 2 , the corresponding PV generation should also be quite large, e.g., 1000 KWh. However, there is a record with 1000 W/m 2 irradiation but very low generated power, e.g., 20 KWh. In the proposed framework, such unmatched structural outliers are removed by an unsupervised machine learning algorithm, i.e., DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [36], which is able to identify hidden outliers from the global perspective. DBSCAN has two major parameters: radius ε determines the scope of a cluster, and minimum number of points N means the minimum number of members in a cluster. It can be regarded as a simple binary classification (normal data vs. outliers) method. Although it is an excellent anomaly detection method, it cannot be directly used in a monitoring system for classifying different operating status and different faults. Due to its design, it is sensitive to data distribution and depends heavily on the manual off-line setting of parameters, which is not suitable for online detection. Especially, under a situation where data are recorded every minute, DBSCAN would gradually turns to be unstable and inaccurate without timely human supervision. The biggest motivation of intelligent PV fault detection is to identify faults instantly and warn engineers of anomalous situations in time, so that they don't have to keep their eyes on these monitoring data but still can notice anomalies at the first time. Using only DBSCAN is hard to perform such goal, so we propose the methods depicted next. After detecting and removing outliers in raw data, we obtain our dataset D.

Feature Analysis
Since we try to predict PV generation, it is necessary to carry out detailed analysis of PV power data (p). First, it is greatly affected by the fluctuation and uncertainty of meteorological factors, and hence exhibits variability and volatility. Particularly, under the nonstationary and low illumination intensity in cloudy and rainy days, PV power data are prone to fluctuating violently [2,26,31,37]. Second, due to the current-limiting nature, a PV system has non-continuous output characteristics of power generation [2,29,37]. However, traditional methods, e.g., linear regression and support vector regression, fit continuous data and output continuous value. The special characteristics of PV power data increase difficulties and challenges for the accurate prediction of PV output. We deal with such non-continuous regression tasks by clustering-based modification which is presented in detail in the next sections.
Besides analyzing PV power data, we pay attention to the factors that affect or contribute to PV output. The most directly related factors are meteorological one. Commonly used factors include solar irradiance (r), temperature of PV panel (τ), relative humidity (h), wind velocity (v), and wind direction (d). In order to capture the non-linear relationship between meteorological factors and PV power, as many feature engineering methods do, we construct two additional features: r L and h L , which are the logarithmic values of r and h. To capture the changing trend (increasing or decreasing) of solar irradiance, we add feature r which denotes the differential irradiance between two adjacent r values.

Data Pre-Classification
In the proposed method, a critical part is to build two models: upper and lower reference models. It is of great importance to select suitable valid data from the original data and use them to train two models. We dig out the original data D and manually select the ideal period dataset D A and the downturn period dataset D B for the upper and lower model, respectively.
As shown in Figure 2, we propose to pre-classify the original data into three periods. For most PV plants, as time goes by, if there are no faults during running, the PV panels degrade due to dust or module deterioration. Thus, the power generation presents a declining tendency as shown in Figure 2. The states of PV panels are divided into the following three periods: 1) Ideal period A: The first time when the panels are brought into operation maintenance (e.g., cleaning and washing), the PV panel is in a healthy and clea without any light barriers. At this time, the efficiency of photoelectric conve comparatively high. The power generation in a PV plant is also at an ideal state, n relatively high and stable.
2) Transition period S: Under a natural state and without any interference, P els gradually accumulate dust and some light barriers (e.g., bird dropping, leave and plastic bags). Under this circumstance, the conversion efficiency slows dow power generation gradually declines too. The total PV power generation makes a al transition from ideal state to a lower state.
3) Downturn period B: When there is visible dust or too much light barriers panels, PV cells receive little solar irradiation, or when they are aging, the photo conversion efficiency reaches its lowest limit, and the generated power continue sluggish.
Among these operating periods, we pay special attention to A and B perio collect data from these two periods to construct the ideal period dataset and turn period dataset . Note that in this paper we manually classify the datase select suitable data for and . According to the definitions of operating perio our preliminary investigation, we are able to select data for and based on w and maintenance records. Based on experience, these two factors are the most ones in regard to generation efficiency. Besides, it is convenient to get access to historical operating records and the weather-related data online. Our method can ily applied in other similar tasks. In the future, we consider labeling historical with different period labels and then train a classification model, thus avoiding selection of data.

Non-Continuous Regression Models
The proposed method builds the upper and lower baseline models with respectively. The training procedure of these two models are similar, and the o ference lies in a different PV dataset we input for training.
As mentioned above, PV power data have special characteristics of variabil non-continuity, which motivates us to deploy an ensemble trees-based regression od called extreme gradient boosting (XGBoost) [38]. It assembles a number of (Classification and Regression Tree) as base learners, which can deliver more a (1) Ideal period A: The first time when the panels are brought into operation or after maintenance (e.g., cleaning and washing), the PV panel is in a healthy and clean state without any light barriers. At this time, the efficiency of photoelectric conversion is comparatively high. The power generation in a PV plant is also at an ideal state, namely, relatively high and stable.
(2) Transition period S: Under a natural state and without any interference, PV panels gradually accumulate dust and some light barriers (e.g., bird dropping, leaves, snow and plastic bags). Under this circumstance, the conversion efficiency slows down, and power generation gradually declines too. The total PV power generation makes a gradual transition from ideal state to a lower state.
(3) Downturn period B: When there is visible dust or too much light barriers on PV panels, PV cells receive little solar irradiation, or when they are aging, the photoelectric conversion efficiency reaches its lowest limit, and the generated power continues to be sluggish.
Among these operating periods, we pay special attention to A and B periods. We collect data from these two periods to construct the ideal period dataset D A and downturn period dataset D B . Note that in this paper we manually classify the dataset D and select suitable data for D A and D B . According to the definitions of operating periods and our preliminary investigation, we are able to select data for D A and D B based on weather and maintenance records. Based on experience, these two factors are the most related ones in regard to generation efficiency. Besides, it is convenient to get access to its own historical operating records and the weather-related data online. Our method can be easily applied in other similar tasks. In the future, we consider labeling historical data D with different period labels and then train a classification model, thus avoiding manual selection of data.

Non-Continuous Regression Models
The proposed method builds the upper and lower baseline models with D A and D B , respectively. The training procedure of these two models are similar, and the only difference lies in a different PV dataset we input for training.
As mentioned above, PV power data have special characteristics of variability and non-continuity, which motivates us to deploy an ensemble trees-based regression method called extreme gradient boosting (XGBoost) [38]. It assembles a number of CART (Classification and Regression Tree) as base learners, which can deliver more accurate prediction. It inherits the advantages of a decision tree algorithm and handles well non-continuous functions, which exactly suits the prediction task on non-continuous PV power data. Hence, we deploy it as our regression algorithm. The power generation p is the output of our prediction model whose inputs are the combination of meteorological features (r, τ, h, v, d), time-related features (T), and additional features ( r, r L , h L ). Then, our regression prediction model is denoted as: where f is an XGBoost-based prediction function. Note that we cannot obtain explicit expressions in a tree-based regression method. Hence, f is a simplified notation of a tree structure and corresponding parameters. Using D A and D B as training datasets, we can obtain two prediction models f A and f B . By inputting a real-time feature vector: into f A and f B , we can conduct PV generation prediction and acquire the upper and lower references, i.e., Although the proposed XGBoost-based regression model is suitable for fitting noncontinuous PV data, it is still a regression method and sometimes obtain outputs that do not exist in a real PV system. Considering the non-continuous characteristic of PV generation, it is necessary to implement further modification to refine (3) and (4), i.e., modifying with the weather scale factors and power data clustering results, which is detailed as follows. Due to the above mentioned current-limiting principle, PV power data are of obviously hierarchical discreteness. Power values belong to several particular groups where they are continuous. Between two adjacent groups, there is a blank gap with no data. We propose to cluster the original power data by using k-means algorithm [35]. In k-means, there is only one key parameter: the number of clusters denoted as k. After clustering, we calculate the minimum and maximum values for each cluster. Hence, we can know the ranges to which actual values belong. For upper or lower predictions located outside existing ranges, we propose to modify them with the maximum or minimum values of their closest cluster.
For the upper prediction value p A , if it does not belong to any cluster, then the principle of proximity is adopted to correct it. We replace it with the maximum value of the closest cluster. The modified prediction value iŝ where Max j is the maximum value of the j-th cluster. Similarly, for p B located beyond any existing cluster, we modify it with the closest minimum value, as follows: where Min j is the minimum value of the j-th cluster.
Considering the variability of PV generation under different weather conditions, we propose the weather scale factors so as to make our prediction more robust. When predicting expected PV generation in bad weather (here, bad weather refers to the case of greatly unstable irradiance or extremely low irradiance, e.g., rainstorms, blizzards, hail, and sandstorms), we multiply them by weather scale factors. In the proposed method, a weather scale factor w is defined as the percentage of reduction of power generation in bad weather. It can be computed as the ratio of average power output from a normal day to that of a bad weather day, which can be derived from the historical monitoring data. Then, (5) and (6) are modified byp A = wp A andp B = wp B . Prediction modification is realized in Algorithm 1. classify it into the closest category: 7. Let Moving each cluster centroid µ j to the mean of the points assigned to it: Repeat above for-loop until the change of centroids is less than a certain threshold 10. Obtain the clustered data (5)

Performance Evaluation, Fault Detection, and O&M Planning
Generally, a PV system can be affected by different types of faults that result in the significant loss of power [20,39,40]. According to the factors causing PV faults, two types of faults can be distinguished: direct and indirect faults. Some direct faults such as cell cracking, nonconnected module, open circuit and short circuit in a PV system, and broken fuse or cable, cause conspicuous performance loss. Indirect factors, such as shading due to dust or light barriers, encapsulation degradation due to ultraviolet and yellowing EVA (ethylene vinyl acetate), module degradation due to light or heat, and rust due to water infiltration, lead to the gradual deterioration of PV panels, and hence result in the gradual power loss [34]. Using the monitored data, a PV monitoring system has to decide whether there is degradation in its generation performance [41].
In the proposed approach, apart from the real-time PV power-versus-time curve displayed in the monitoring system, there are also two reference curves (A and B) from regression models exhibited in the same figure. For each real-time record (including a feature vector x and its corresponding power generation p), we obtain its expected ideal and worst PV generationsp A andp B by inputting its feature vector x into Algorithm 1. To simplify, we set a =p A and b =p B . Our method evaluates the PV panels' real-time status by comparing real-time PV power p with a and b. As in Table 2 BetweenB andB: Fluctuating nearB: Far belowB baseline : If real-time power exhibits more than a given percentage of generation losses, they are classified into a downturn or malfunction period. In our method, we set α 1 , α 2 , β 1 , and β 2 (α 1 > α 2 , β 1 > β 2 and α 1 , α 2 , β 1 , β 2 ∈ [0, 1]) to divide up the warning ranges. Users receive an alarm if a PV panel produces more than α 1 of the expected ideal baselineB, i.e., p > (1 + α 1 )a, or less than β 1 of the expected worst baselineB, i.e., p < (1 − β 1 )b, which means that the sensors or PV panels may break down or the data transmission is incorrect. If p is fluctuating nearB, i.e., (1 − β 2 )b < p < (1 + β 2 )b, the PV panels are of low efficiency, and they need timely maintenance, such as manual cleaning and equipment repair. For α 1 and β 1 , smaller values mean stricter alert and more sensitive detection; larger values mean looser limitation, but help reduce false alarms. On the contrary, a larger β 2 means larger range of Stage 4, which may result in more observations classified into the downturn period, hence bring in more false alarms.
If p is nearB, i.e., (1 − α 2 )a < p < (1 + α 2 )a, it indicates that the PV power generation is running in an ideal state. There is no need to implement any maintenance. Furthermore, there is no warning or alarm when p is betweenB andB, i.e., (1 it is in the transition period, and we consider it as a normal life cycle of PV panels. Note that, if α 2 is too large, there are more data classified into the ideal period, leading to the risk of misclassification of potential faults.

Data Description
We have conducted experiments based on the proposed method and used it in a 6.95 MW PV plant. Apart from zero-records (at night or missing), available effective monitoring data consist of 5936 records (In our experiments, the monitoring system records sensor data every 15 min). Although the time range of the collected dataset covers less than one year, our data included all kinds of weather situations, especially some extreme weather, e.g., snow, high temperature and rainstorms. Examples of the collected data are shown in in Supplementary File. As mentioned in Section 3, we collect power data p and features x = [r, τ, h, v, d, T, r, r L , h L ] in (2).

Results of Data Preprocessing
We use DBSCAN [36] to detect outliers and the parameters are set as ε = 46 and n = 25. To explore raw data intuitively, we plot PV data of a week in Figure 3. Only daytime hours (from 6:00 to 18:00) are considered in our PV forecasting application. As in Figure 3, PV power experiences violent fluctuation within a day as well as drastic variation among days. Meanwhile, we plot all measured PV power data versus time, as shown in Figure 4. It is noticeable that PV power data is non-continuous. Looking at the picture from the bottom to top, there are many blank gaps between two adjacent data dot clusters. Obviously, power data belong to different groups. In sum, the characteristics of raw data: intensive fluctuation and variability, and hierarchical non-continuity, motivate us to apply an XGBoost algorithm that suits most for non-continuous PV power regression task.
from the bottom to top, there are many blank gaps between two adjacent data dot clusters. Obviously, power data belong to different groups. In sum, the characteristics of raw data: intensive fluctuation and variability, and hierarchical non-continuity, motivate us to apply an XGBoost algorithm that suits most for non-continuous PV power regression task.  We aim to select appropriate data for two datasets, and , reflecting the ideal and downturn periods, respectively. First, we get by considering maintenance records of a studied PV plant. The scheduled maintenance plan for year 2018 is to do the cleaning work every month, and each last for nearly half a month (from 16th to the end of that month). Obviously, the cleaning work is quite frequent. Under this circumstance, PV panels always stay in a comparatively clean status, namely, the ideal period. (Since we need the downturn period data, we stop the monthly cleaning from May.) Moreover, we consider that PV panels are perfectly clean two days after cleaning, so we take data from 18th to the end of the month from January to April as an ideal period dataset. For example, in January, data are collected into . In order to obtain valid , we suspend the monthly cleaning from May. Without cleaning, PV panels depend on the rain to wash away dust or other light barriers. So, it is important to find data originated from to apply an XGBoost algorithm that suits most for non-continuous PV power regression task.  We aim to select appropriate data for two datasets, and , reflecting the ideal and downturn periods, respectively. First, we get by considering maintenance records of a studied PV plant. The scheduled maintenance plan for year 2018 is to do the cleaning work every month, and each last for nearly half a month (from 16th to the end of that month). Obviously, the cleaning work is quite frequent. Under this circumstance, PV panels always stay in a comparatively clean status, namely, the ideal period. (Since we need the downturn period data, we stop the monthly cleaning from May.) Moreover, we consider that PV panels are perfectly clean two days after cleaning, so we take data from 18th to the end of the month from January to April as an ideal period dataset. For example, in January, data are collected into . In order to obtain valid , we suspend the monthly cleaning from May. Without cleaning, PV panels depend on the rain to wash away dust or other light barriers. So, it is important to find data originated from continuous sunny days, where PV panels may be covered with dust or light barriers, and hence in the downturn period. By checking the historical weather, which is freely available online, we are able to select suitable valid data for from May to July. Take We aim to select appropriate data for two datasets, D A and D B , reflecting the ideal and downturn periods, respectively. First, we get D A by considering maintenance records of a studied PV plant. The scheduled maintenance plan for year 2018 is to do the cleaning work every month, and each last for nearly half a month (from 16th to the end of that month). Obviously, the cleaning work is quite frequent. Under this circumstance, PV panels always stay in a comparatively clean status, namely, the ideal period. (Since we need the downturn period data, we stop the monthly cleaning from May.) Moreover, we consider that PV panels are perfectly clean two days after cleaning, so we take data from 18th to the end of the month from January to April as an ideal period dataset. For example, in January, data are collected into D A . In order to obtain valid D B , we suspend the monthly cleaning from May. Without cleaning, PV panels depend on the rain to wash away dust or other light barriers. So, it is important to find data originated from continuous sunny days, where PV panels may be covered with dust or light barriers, and hence in the downturn period. By checking the historical weather, which is freely available online, we are able to select suitable valid data for D B from May to July. Take July as an example, it is rainy in the first week. Thus the data are not appropriate for dataset D B . But since 8 July 2018, it has been cloudy, overcast or sunny, which suits our selection principle of dataset D B . So, we take data from 9 July 2018 to 15 July 2018 into dataset D B .
To conclude, by data preprocessing, we determine the XGBoost-based regression algorithm according to the special characteristics of PV power data. Moreover, we present how to construct datasets D A and D B . More detailed analyses about data preprocessing can be viewed in Supplementary File.

Results of Non-Continuous Regression Models
We compare XGBoost with seven universal regression methods, and each of them has achieved good performance in existing research. This includes multivariable linear regression (MLR) attempts to model the relationship between two or more explanatory variables and a response variable by fitting a linear equation to observed data. ElaticNet [42] is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods. Support vector regression [27] is a version of support vector machine (SVM) for regression. Here, we apply kernel-based SVM. Support vector regression with linear kernel is denoted as SVR, and the one with radial basis function kernel is denoted as SVR-RBF. Decision tree regression (DTR) [43] uses tree structure to predict the continuous output on the basis of input or situation described by a set of properties. Random forest regression (RFR) [28] is an ensemble learning method, which constructs a multitude of decision trees at training time and outputs the mean prediction of the individual trees. Gradient boosting decision tree (GBDT) [44] builds the ensemble model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. Note that we compare their basic regression performance and do not implement the proposed modifications (clustering-based one and weather scale factors-based one). The compared algorithms are listed: (1) Multivariable linear regression (MLR) (2) ElasticNet (3) Support vector regression with linear kernel (SVR) (4) Support vector regression with radial basis function kernel (SVR-RBF) (5) Decision tree regression (DTR) (6) Random forest regression (RFR) (7) Grmadient boosting decision tree (GBDT) The above algorithms are available in scikit-learn [45] which is a free software machine learning library for the Python programming language. Among them, RFR and GBDT are similar to XGBoost. They are tree-based ensemble methods but others use a single model.
To validate the generalization performance, we extract four datasets from the dataset D: April, May, June, and July. They are under different meteorological conditions. Then, for each dataset, we split 50% for training regression models and the rest for testing. We apply five-fold cross validation to search the optimal parameters that show the highest accuracy.
We use three performance metrics on test data. They are the ratio of root mean squared error to the mean value denoted as E, the mean absolute error denoted as E, and the goodness of fit denoted as R 2 . where y i is the ground truth of p,ŷ i is the prediction value, y i is the mean of all y i , and n is the number of test samples. All experiments are carried out in Python 3.7 with an Intel Core i5-8250 CPU and 8G memory. Table 3 shows the performance results on each dataset, and the best one is highlighted in bold. From Table 3, compared to other methods, XGBoost generally achieves much better performance. Even if it does not rank the best on dataset May, it achieves the second-best with a small gap to the best one, i.e., RFR. Table 4 presents the average evaluation metrics on the four datasets. XGBoost outperforms its peers with the best average performance metrics. In particular, XGBoost achieves better performance than RFR and GBDT, both of which are the state-of-the-art ensemble regression methods. In sum, compared with its seven peers, XGBoost achieves higher average accuracy and more generalized performance under different meteorological conditions. Table 3. Accuracy of algorithms.

Algorithms
April May June July Then, we apply XGBoost to train upper and lower reference models. We randomly split 50% of selected D A and D B to train regression models. The rest of D A and D B are for testing. The optimal XGBoost parameters for upper models are as follows: maximum depth is 3, learning rate is 0.1 and the number of estimators is 125. As for the lower model, maximum depth is 3, learning rate is 0.06, and the number of estimators is 300. For both upper and lower models, other not-mentioned parameters are set as default values. As discussed in Section 3, we deploy a k-means clustering algorithm to classify original PV power data and then take the clustering results to modify the prediction values. In the k-means algorithm, we set parameter k = 16 to ensure that our data are exactly classified into 16 classes (As in Figure 4, there are 16 classes). The clustering results are presented in the Supplementary File section. We calculate the maximum value (Max) and the minimum value (Min) of every cluster. Then, it is necessary to take the weather scale factors into consideration to avoid incorrect near-zero PV prediction. The near-zero values usually are regarded as data from a malfunction period, but for extreme weather, it is reasonable to get zero PV output. Some methods may produce false alarm under this situation. The weather scale factors are used to modify the predictions under the case of greatly unstable irradiance or extremely low irradiance, thus effectively avoiding the abovementioned false alarm. The weather scale factors are computed as the ratio of average power output from a normal day to that of a bad-weather day. The studied PV plant is operated under normal weather conditions, i.e., the collected datasets does not include data under extreme weather conditions (e.g., blizzards, hail, and sandstorms), and we thus set w = 1. Our future work plans to find more datasets that cover all types of weather conditions and conduct related analysis to show how weather scale factors can be well-used to improve model performance. Following Algorithm 1, we obtain the final predictions for the test samples. Table 5 presents model performance on training and test datasets. E and E are low on two test datasets, R 2 is very close to 1 on both models, indicating that our two prediction models are highly accurate and reliable. Note that, the performance in Table 5 is not as good as those in Tables 3 and 4. This is because our selected datasets D A and D B consist of records from different months, and hence the training and test datasets are less stable than those in above experiments. To better present the superiority of our proposed modification methods, i.e., the clustering-based modification and weather scale factor-based one as shown in Algorithm 1, Figure 5a,b show an example of the monitoring system before and after modifica-tion, respectively. Two trained models are applied to the monitoring data of a day in May which are presented as black hollow circles. The blue curve is the upper reference from f A , and the purple one is the lower reference deriving from f B . In Figure 5, it is noticeable that purple one is sometimes above the blue one, e.g., curves during 5:28-6:43, 8:28-9:28, and 16:28-17:58. Moreover, there are many observations below or above the reference lines. Instead, in Figure 5, there is no overlap, and the purple reference line is below the blue one. Furthermore, the modifies references are more consistent with the changing trend of actual PV power values, and there are also less observations located outside two references.
Processes 2021, 9, x FOR PEER REVIEW 14 of 21 the ratio of average power output from a normal day to that of a bad-weather day. The studied PV plant is operated under normal weather conditions, i.e., the collected datasets does not include data under extreme weather conditions (e.g., blizzards, hail, and sandstorms), and we thus set = 1. Our future work plans to find more datasets that cover all types of weather conditions and conduct related analysis to show how weather scale factors can be well-used to improve model performance. Following Algorithm 1, we obtain the final predictions for the test samples. Table 5 presents model performance on training and test datasets. � and � are low on two test datasets, R 2 is very close to 1 on both models, indicating that our two prediction models are highly accurate and reliable. Note that, the performance in Table 5 is not as good as those in Tables 3 and 4. This is because our selected datasets and consist of records from different months, and hence the training and test datasets are less stable than those in above experiments. To better present the superiority of our proposed modification methods, i.e., the clustering-based modification and weather scale factor-based one as shown in Algorithm 1, Figure 5a,b show an example of the monitoring system before and after modifica-tion, respectively. Two trained models are applied to the monitoring data of a day in May which are presented as black hollow circles. The blue curve is the upper reference from , and the purple one is the lower reference deriving from . In Figure 5, it is noticeable that purple one is sometimes above the blue one, e.g., curves during 5:28-6:43, 8:28-9:28, and 16:28-17:58. Moreover, there are many observations below or above the reference lines. Instead, in Figure 5, there is no overlap, and the purple reference line is below the blue one. Furthermore, the modifies references are more consistent with the changing trend of actual PV power values, and there are also less observations located outside two references.

Results of Performance Evaluation, Fault Detection, and O&M Planning
Based on two modified prediction models, we obtain the corresponding upper and lower references of power generation. We compare them with the measured power value, and assess PV panel status according to their distributions in the reference range. After different tests about selecting suitable parameters, we recommend to set 1 = 1 = 0.5 and 2 = 2 = 0.1, which show satisfactory results in most experiments. To make a

Results of Performance Evaluation, Fault Detection, and O&M Planning
Based on two modified prediction models, we obtain the corresponding upper and lower references of power generation. We compare them with the measured power value, and assess PV panel status according to their distributions in the reference range. After different tests about selecting suitable parameters, we recommend to set α 1 = β 1 = 0.5 and α 2 = β 2 = 0.1, which show satisfactory results in most experiments. To make a comprehensive comparison, we show pictures of different weather conditions, i.e., a sunny day in Figure 6a, and a cloudy dayin Figure 6b. There are five kinds of typical distributions. As discussed before, data in Stages 1, 4, and 5 can provide early warning for the engineers in a PV station. In Stages 1 and 5, actual values deviate far from the references. This is the malfunction period, where the sensors may break down and the data transmission may be incorrect. Specially for Stage 5 where the power is relatively low, chances are high that there are open circuits or short circuits in PV panels. In Stage 4, the generated power is comparatively low. PV panels are in a downturn period, and may be covered by dust and need cleaning. It is worth noting that some disturbance in the power grid also leads to the decrease of output power. It may fall below the lower reference curve, but it is not due to a failure in the PV plant. After such warning, on-site engineers need to conduct further inspection and corresponding O&M plans. Stages 2 and 3 do not trigger warning, because they correspond to an ideal period and transition period. According to our previous data analysis and definitions of stages, the transition period (Stage 3) and the downturn period (Stage 4) are the most common ones. Generally, a normal operating PV station does not frequently break down with severe faults or always yield an ideal power generation with high operational efficiency. Hence, a malfunction period (Stages 1 and 5) and ideal period (Stage 2) are the relatively less common ones. comprehensive comparison, we show pictures of different weather conditions, i.e., a sunny day in Figure 6a, and a cloudy dayin Figure 6b. There are five kinds of typical distributions. As discussed before, data in Stages 1, 4, and 5 can provide early warning for the engineers in a PV station. In Stages 1 and 5, actual values deviate far from the references. This is the malfunction period, where the sensors may break down and the data transmission may be incorrect. Specially for Stage 5 where the power is relatively low, chances are high that there are open circuits or short circuits in PV panels. In Stage 4, the generated power is comparatively low. PV panels are in a downturn period, and may be covered by dust and need cleaning. It is worth noting that some disturbance in the power grid also leads to the decrease of output power. It may fall below the lower reference curve, but it is not due to a failure in the PV plant. After such warning, on-site engineers need to conduct further inspection and corresponding O&M plans. Stages 2 and 3 do not trigger warning, because they correspond to an ideal period and transition period. According to our previous data analysis and definitions of stages, the transition period (Stage 3) and the downturn period (Stage 4) are the most common ones. Generally, a normal operating PV station does not frequently break down with severe faults or always yield an ideal power generation with high operational efficiency. Hence, a malfunction period (Stages 1 and 5) and ideal period (Stage 2) are the relatively less common ones.
(a) (b) Figure 6. Five kinds of distribution on (a) a sunny day and (b) a cloudy day.
In order to explicitly present the performance evaluation, we plot the warning boundary lines in Figure 7a,b. In the monitoring system, the engineers are capable of directly distinguishing the PV panel statuses and getting suggestions about how to carry out proper O&M plans. As in Figure 7a, the PV power generation experiences an abrupt decline and drops greatly at 14:28 in 2018/5/11, which indicates that the PV station is in a malfunction period and maintenance is required. According to the abrupt decline and long-lasting Stage 5, we consider that there is direct fault in the PV plant, e.g., nonconnected modules and short/open circuits. Such direct faults are relatively easy to notice in a monitoring system. They usually happen in Stage 5, accompanied by an obvious and long-term decline of PV generation. In this case, further O&M plans lie in checking detailed PV records about each panel and then locating the faulty one(s). As in Figure 7b, a cloudy day in summer, the solar irradiance is strong, so the curves of p, a, and b are nearly sinusoids shape. At 10:37, both A and B baselines fall greatly, whereas the actual p stays in the original trend. There is a high chance that meteorological sensor errors or transmission mistakes appear. The wrong data are input to our prediction models, so we get wrong results. We suggest that further O&M plans attach importance to check the original database and repair or replace faulty sensors. In order to explicitly present the performance evaluation, we plot the warning boundary lines in Figure 7a,b. In the monitoring system, the engineers are capable of directly distinguishing the PV panel statuses and getting suggestions about how to carry out proper O&M plans. As in Figure 7a, the PV power generation experiences an abrupt decline and drops greatly at 14:28 in 2018/5/11, which indicates that the PV station is in a malfunction period and maintenance is required. According to the abrupt decline and long-lasting Stage 5, we consider that there is direct fault in the PV plant, e.g., nonconnected modules and short/open circuits. Such direct faults are relatively easy to notice in a monitoring system. They usually happen in Stage 5, accompanied by an obvious and long-term decline of PV generation. In this case, further O&M plans lie in checking detailed PV records about each panel and then locating the faulty one(s). As in Figure 7b, a cloudy day in summer, the solar irradiance is strong, so the curves of p, a, and b are nearly sinusoids shape. At 10:37, both A and B baselines fall greatly, whereas the actual p stays in the original trend. There is a high chance that meteorological sensor errors or transmission mistakes appear. The wrong data are input to our prediction models, so we get wrong results. We suggest that further O&M plans attach importance to check the original database and repair or replace faulty sensors. As for the validation, as there is no label about which performance stage the PV system is in and which fault it suffers (which makes it difficult to conduct detailed verification and give specific classification metrics), we have manually labelled the data and conducted classification experiments to verify the performance of our method compared with other advanced machine learning classification algorithms. The input of classification models are the monitoring records that include both meteorological data as listed in (2) and corresponding generated PV power data. The output of classification models indicates which performance period the PV system is in, i.e., malfunction, ideal, transition, or downturn periods. We compare our method with several widely-used and powerful algorithms under their classification implementations, i.e., support vector machine classification [46][47][48] with linear kernel (SVC-Linear), support vector machine classification with RBF (SVC-RBF), decision tree classification (DTC) [49], random forest classification (RTC) [50], gradient boosting decision trees classification (GBDTC) [51] and extreme gradient boosting trees classification (XGBC) [52]. The above algorithms are available in scikit-learn [45]. The classification performance metrics are shown in Figure 8a-g and Tables 6-12.  As for the validation, as there is no label about which performance stage the PV system is in and which fault it suffers (which makes it difficult to conduct detailed verification and give specific classification metrics), we have manually labelled the data and conducted classification experiments to verify the performance of our method compared with other advanced machine learning classification algorithms. The input of classification models are the monitoring records that include both meteorological data as listed in (2) and corresponding generated PV power data. The output of classification models indicates which performance period the PV system is in, i.e., malfunction, ideal, transition, or downturn periods. We compare our method with several widely-used and powerful algorithms under their classification implementations, i.e., support vector machine classification [46][47][48] with linear kernel (SVC-Linear), support vector machine classification with RBF (SVC-RBF), decision tree classification (DTC) [49], random forest classification (RTC) [50], gradient boosting decision trees classification (GBDTC) [51] and extreme gradient boosting trees classification (XGBC) [52]. The above algorithms are available in scikit-learn [45]. The classification performance metrics are shown in Figure 8a-g and Tables 6-12. with RBF (SVC-RBF), decision tree classification (DTC) [49], random forest classification (RTC) [50], gradient boosting decision trees classification (GBDTC) [51] and extreme gradient boosting trees classification (XGBC) [52]. The above algorithms are available in scikit-learn [45]. The classification performance metrics are shown in Figure 8a-g and Tables 6-12.   Based on whether PV generation is matched with real-time meteorological records, we manually classify the original data into 4 classes, i.e., malfunction, ideal, transition and downturn periods. With meteorological records, it is possible to calculate the nominal power generation by formulas of photoelectric conversions. Specifically, the generated PV power in an ideal period is supposed to be close to the nominal power generation; the generation in a transition period is slightly lower than the nominal one; the generation in a downturn period is relatively low but reasonable (due to too-much light barriers or aging panels); and the generation values in a malfunction period are extremely larger or lower than the nominal values. Such manual divisions are conducted based on expert knowledge and prior experience. We label the malfunction, ideal, transition and downturn periods with the class indices 0, 1, 2, and 3, respectively. Then, we split 75% for training classification models and the rest for testing. We apply five-fold cross validation to search the optimal parameters that show the highest performance. Figure 8 shows the confusion matrix (CM) of all classes in test dataset. Tables 6-12 details the performance metrics (precision, recall, f1-score) of each compared algorithm and our method. From Figure 8a-g and Tables 6-12, we can conclude that our method achieves the best classification performance with the highest averaged precision 0.94, recall 0.93 and f1-score 0.93. Moreover, the other compared methods are far behind, which validates the superiority of our method.
In addition, we assess model performance by consulting engineers and judge whether the proposed method gives right performance evaluation and accurate fault alarm. We apply the proposed method to the monitoring system of our studied PV plant. According to the feedback from their on-site engineers, our method achieves accurate performance evaluation and fast fault detection. First, our method is able to present instantaneous evaluation for each real-time observation. With the assistance of our method, the O&M engineers do not have to keep their eyes on the curves, and they only check the database when Stages 1 and 4 appear. Second, our method is able to detect both direct and indirect faults in a PV system. It presents an accurate classification and seldom misses potential anomalous situation, which greatly enhances the operation safety and maintenance efficiency. More results and analyses are presented in the Supplementary File section.
Although our proposed method runs well in a practical PV plant, from Figure 7a,b we can tell that there are still a few unusual observations in early morning and late afternoon when illumination intensity is quite weak. We plan to improve the robustness of our prediction models as future research, so that they can make more accurate prediction even when the power output is pretty low.

Discussion
In practical scenarios of PV stations, direct faults, like open circuit and transmission errors, are comparatively easy to notice in the monitoring system. There is an abrupt shift from previous trend. Among indirect factors, encapsulation or module degradation is common in the life cycle of PV panels, which is unavoidable. Therefore, the difficult task of O&M in a PV plant is to intelligently implement panel cleaning, including dust removal and anti-blocking. Compared to direct faults, shade reduces a small amount of output power, which is hard to be detected. In the past, the cleaning O&M of PV panels was mainly periodically manual or robotic cleaning, such as once a month or once a week. Now with the proposed method, which evaluates the state of PV panels and provides instantaneous alarm of degradation, the cleaning maintenance is triggered only when needed. Furthermore, the proposed framework can easily detect direct PV faults and offer timely O&M suggestions.

Conclusions
This paper presents an O&M framework consisting of an intelligent detection structure which can enhance the O&M efficiency in the PV monitoring system and reduce the burden on monitoring staff. Our method evaluates operating performance and identifies anomalies by comparing to two reference baselines, which is an unsupervised way and exerts no dependence on labeled faulty data. Moreover, considering the special characteristic of non-continuity in PV generation, we build corresponding non-continuous regression models, which are based on XGBoost algorithm and refined by the results of k-means clustering. Last, by comparing the real-time measured value with both the upper and lower references, our method is sensitive to indirect faults and can provide instantaneous alarm of degradation. Results on a 6.95 MW PV plant indicate that the proposed method is able to evaluate different operating statuses and provide faults identification and O&M suggestions to engineers.
Our work focuses on performance monitoring, fault detection and diagnosis, and O&M optimization in large complex systems. With proper data of ideal and downturn periods, the proposed method can be easily applied to other similar engineering scenarios, such as the assessment of workshop equipment and fault detection in wind power plants. Moreover, our method can be transferred to the application of RUL (remaining useful life) [53] prediction and equipment's PHM (prognostic and health management) [54]. In future studies, we plan to concentrate on the classification refinement of detected faults and predictive maintenance based on the proposed method.