1. Introduction
A photovoltaic (PV) power plant is one of the most renewable, sustainable, and eco-friendly setups for converting solar energy, which is the most abundant and freely available energy source, into electrical energy [
1]. The contribution of solar energy to the total global energy supply has rapidly increased in recent decades with PV installation capacity growing to more than 500 GW by the end of 2018 [
2,
3]. However, PV systems are exposed to harsh working conditions owing to uncertain outdoor environments and their complex structure. In [
4], it was reported that the annual losses in PV generation reached 18.9% under zero or shading faults. Improving the reliability of renewable energy generation by fault detection and diagnosis (FDD) and correcting faulty data is essential for maintaining the efficiency of PV generation [
5]. In addition, reliable information is required to be applied for various power applications, e.g., energy scheduling [
6] and energy forecasting [
7,
8], to guarantee safe and stable grid systems.
For utility planners and operators, it is essential to examine the power output variability [
9] to aggregate the fleet of PV systems, which is defined as the number of individual PV systems spread out over a geographical area [
10]. Several studies presented station-pair correlation analyses by introducing virtual networks. A correlation was observed between short-term irradiance variability as a function of diverse distance and time scale [
11]. Similarly, the maximum output variability of a fleet of PV plants was estimated by using the clearness index [
12]. A variability model was built in [
13] to integrate a large amount of generated solar power into power systems. To integrate the fleet as a distributed power source, it should be managed by an intelligent monitoring system that can correct abnormal data via real-time fault diagnosis and power generation forecasts.
PV faults occur because of various reasons at different locations, such as a module, string, or any other spot related to the PV systems. Visual and thermal methods employed in PV fault detection can detect superficial problems, such as browning, soiling or snow, discoloration, delamination, and hot spots, using auxiliary measurements [
14]. However, this requires expensive and complicated equipment [
15]. In recent years, many studies have employed methods using electrical variables via a data-driven approach. The electrical signal approaches are mainly referred to as maximum power point tracking (MPPT) with I–V characteristic analysis and power loss analysis. They are usually utilized to distinguish an open circuit, short circuit, degradation or aging, and shading faults that may typically occur on the DC side of a PV array [
16,
17,
18].
For data-driven methods, automatic fault detection approaches can be categorized into conventional modeling-based methods and methods that utilize intelligent machine learning [
19]. For the former case, the model can be built with respect to the physical attributes from the PV module specification for simulation settings to compare the desired output with the measured output [
20]. Conventional statistical detection methods have been primarily presented in previous studies [
16,
21,
22,
23,
24,
25]. The exponentially weighted moving average has been used to identify DC side faults by comparing the one-diode model and estimated MPPs [
21,
22]. Lower and upper limits were set when the ratio of the measured to the modeled AC power exceeds 3-sigma [
22]. In [
23], outlier detection rules were proposed in their statistical details: The 3-sigma rule, Hampel identifier, and Boxplot rule using a PV string current. A symbolic aggregate approximation (SAX) scheme was used to convert the voltage profile, prior to performing clustering and anomaly detection [
24].
With the advent of artificial intelligence (AI), which can be applied to various domains, particularly suited to the nonlinear behavior of PV systems, numerous studies have exploited AI-based monitoring systems [
26]. The common artificial neural network (ANN) is widely used either to predict PV generation behavior or as a fault detection module based on several electrical parameters [
15,
27,
28,
29,
30,
31]. In comparison with a conventional back-propagation network, a probabilistic neural network (PNN) uses a probability density function as the activation function; thus, it is less sensitive to noisy and erroneous samples [
32,
33,
34]. Fault detection by a support vector machine (SVM) has been used in several studies because it has the ability to separate objects by finding an optimal hyperplane that maximizes the margin in both binary and multiclass problems [
35,
36,
37]. The decision tree (DT) builds repetitive decision rules within if/else instructions, which is intuitive. The model can be implemented conveniently with a large dataset [
35,
38]. The random forest (RF) has been applied to improve multiclass classification accuracy and to generalize performance [
19]. Fuzzy classifications based on a fuzzy inference system (FIS) were developed by constructing logic rules [
29,
39]. A kernel extreme learning machine was investigated owing to its fast learning speed and good generalization [
40]. Particle swarm optimization-back-propagation (PSO-BP) has been shown to improve the convergence and prediction accuracy of fault diagnosis systems [
41].
Most of the data-driven approaches involve a supervised learning-based fault detection system that assigns a label for binary PV states as either normal or abnormal or as multi-class for corresponding fault types in advance. The detection or classification model learns the complex and unrevealed relation between input attributes and predefined labels in the training phase, and then the model is tested to determine whether it can distinguish PV states properly for new inputs. However, these processes require human effort to manually assign labels, and it is not easy to visualize the trained model. Graph-based semi-supervised learning (GBSSL) was proposed to detect line-to-line and open circuit faults using a few labeled data [
42]. In [
20], five types of faults were classified based on a single diode model with five input vectors associated with IV characteristics, solar irradiance, and temperature. Gaussian-fuzzy C-means was conducted using the distribution of each cluster and faults were diagnosed through PNN based on previous cluster center information [
33]. A fuzzy membership algorithm based on degrees of fault data and cluster centers has been proposed [
43]. Density peak-based clustering has also been proposed [
44]. The 3-sigma rule was applied to determine each cluster center using the normalized voltage and current at the MPPs. Similarly, the PV local outlier factor (PVLOF) was computed from the current of the PV array to identify the degree of faults [
45]. A single diode model-based prediction was implemented, enabling the generation of the residual, which was applied to the one-class SVM by quantifying the dissimilarity between the normal and faulty features [
46].
In this study, we propose a framework of two stages of self-fault detection and self-imputation in a fleet of PV systems using neighboring PV power generation units based on correlation analysis. Because insolation data is not available with sufficient geographical resolution, especially for a small-scale PV system [
12,
37], neighboring PV generation data in the same fleet can be used jointly with distanced weather data. Since daily PV generation captures generally include unidentified erroneous samples, faulty data candidates were first labeled in the proposed scheme by an unsupervised manner with several extracted features. K-means clustering was employed to find out fault data point in the daily PV power outputs obtained from all the sites in the fleet. When the profile was considered as an abnormal pattern, restoration was accomplished by the following imputation step. Imputation schemes were implemented by autoregressive (AR) and multiple regression models with optional neighboring PV data of normal candidates obtained from the previous clustering step. For evaluation, several types of fault patterns observed in actual PV power profiles were simultaneously injected into a single or multiple sites, and proposed schemes were tested without injection information.
The remainder of this paper is organized as follows:
Section 2 describes the PV fleet power output relationship between distance and the correlation with actual data measured on campus.
Section 3 proposes an efficient fault detection and imputation methods for use with a PV fleet. The simulation setup, including the injected fault pattern, is provided in
Section 4.
Section 5 details the detection and imputation results, and
Section 6 concludes the paper.
6. Conclusions
This paper presents a framework for PV fault detection and imputation method on PV fleets without the manual annotation of the state of the PV systems. We supplement the meteorological data measured at LWS, which had a relatively low value for the cross-correlation with PV generation, using the neighboring PV fleet data. Several features were derived to be used as input for K-means clustering to label normal or abnormal patterns. PV fleet on the campus and solar irradiance data measured at one of the PV sites in the fleet were utilized to extract the features for fault pattern detection. We arbitrarily injected a fault pattern based on actual observations, and the detection accuracy was evaluated using a confusion matrix. The detection error rate was compared for three cases: Using SolarClique (a conventional prediction-based detection method), a clustering-based method without PV fleet data, and a clustering-based method with PV fleet data which is the proposed method. The error rates for these three cases were 0.0753, 0.0223, and 0.0126, which means that the proposed clustering-based detection using neighboring PV fleet data can effectively detect the faults.
Data imputation was conducted for the distinguished abnormal patterns. Five cases of regression-based imputation and kNN were evaluated by NRMSE. In general, Cases 4 and 5, which utilized neighboring self-labeled PV data, showed better imputation performance than imputation without nearby sites or without labeled PV data by reducing NRMSE over 19% and 18%, respectively. In addition, according to earlier grouping information based on cross-correlation analysis, G3, which were close SWS, generally showed better performance when only solar irradiance data were used. However, the imputation result for G3 sites and neighboring PV profiles provided more relevant information than weather data obtained at SWS that was relatively far away. In summary, neighboring PV data are effective in improving fault detection and imputation accuracy in a dense PV fleet.