A Novel Photovoltaic Array Outlier Cleaning Algorithm Based on Sliding Standard Deviation Mutation

: There is a large number of outliers in the operation data of photovoltaic (PV) array, which is caused by array abnormalities and faults, communication issues, sensor failure, and array shutdown during PV power plant operation. The outlier will reduce the accuracy of PV system performance analysis and modeling, and make it di ﬃ cult for fault diagnosis of PV power plant. The conventional data cleaning method is a ﬀ ected by the outlier data distribution. In order to solve the above problems, this paper presents a method for identifying PV array outliers based on sliding standard deviation mutation. Considering the PV array output characteristics under actual environmental conditions, the distribution of array outliers is analyzed. Then, an outlier identiﬁcation method is established based on sliding standard deviation calculation. This method can identify outliers by analyzing the degree of dispersion of the operational data. The veriﬁcation part is illustrated by case study and algorithm comparison. In the case study, multiple sets of actual operating data of di ﬀ erent inverters are cleaned, which is selected from a large grid-connected power station. The cleaning results illustrate the availability of the algorithm. Then, the comparison against the quantile-algorithm-based outlier identiﬁcation method explains the e ﬀ ectiveness of the proposed algorithm.


Introduction
Rapid growth in photovoltaic (PV) capacity requires continuous improvement of smart operation and maintenance of PV systems. The smart operation and maintenance platform for PV systems include core functions such as performance analysis, state evaluation, fault diagnosis, and predictive maintenance [1]. The implementation of such functions depends on high-quality and reliable data. The actual operation of the PV system, however, produces an abundance of outliers due to data propagation signal noise, sensor failure, communication and measurement equipment failure, maximum power tracking abnormalities, array shutdown, and power limitation, among other issues [2]. Outliers can seriously affect the quality of the original data and affect the implementation of intelligent operation and maintenance functions [3]. Data cleaning is the basis of intelligent PV operation and is of great significance in practical engineering applications.
Many researchers have been exploring outlier identification and data cleaning in new energy power generation systems. The global probability statistical method and the intelligent clustering

PV Array Electrical Characteristics
The relationship between the PV array power, current, voltage, and external environmental parameters must be established to further investigate the actual operational data distribution of the array. The current I and voltage V under the theoretical condition of PV module can be calculated by Formulas (1)-(2) [19]. The theoretical output power calculation formula of the PV array is shown in Formulas (3)-(5) [20].
where G is the irradiance, τ pv is the solar cell transmittance of the outer layer, η Tre f is the conversion efficiency of the solar cell, γ is the value of thermal coefficient of max power for crystalline silicon and γ = −0.0045, A is the surface area of the PV module receiving surface, and T c is the operating temperature of the solar cell. The cell operating temperature is not easy to get in actual cases, so T NOCT is adopted for calculating the cell conversion efficiency: NOCT ]} (4) where T NOCT is the temperature at normal operating cell temperature (NOCT) conditions and G NOCT is the irradiance at NOCT conditions. The output power of the module is: where T a is the ambient temperature. According to Formulas (1)- (5), combined with the Simulink model, the theoretical curves of the operating current I, and the irradiance G, the power P and the irradiance G are established, as shown in Figure 1.
As shown in Figure 1, theoretically, the two curves of the PV array's operating current I and irradiance G, the PV array's output power P and irradiance G approximate a linear relationship. The relationship between PV array operational data and environmental variables is the foundation for analyzing the source of outliers. As shown in Figure 1, theoretically, the two curves of the PV array's operating current I and irradiance G, the PV array's output power P and irradiance G approximate a linear relationship. The relationship between PV array operational data and environmental variables is the foundation for analyzing the source of outliers.

Source and Distribution of the Outlier
The data collected from the actual operation of the PV array usually contain a large number of outliers. The causes of outliers in photovoltaic systems mainly include hardware failure, signal noise, photovoltaic power limitation, and so on [21,22]. Moreover, outliers generated by various causes show different characteristics in the electrical parameter curve. Based on the relationship between irradiance and output power, this paper analyzes the distribution characteristics of abnormal data and classifies the types of outliers. Figure 2a shows the PV array monitoring structure. Figure 2b shows the distribution and source of PV array outliers. (a)

Source and Distribution of the Outlier
The data collected from the actual operation of the PV array usually contain a large number of outliers. The causes of outliers in photovoltaic systems mainly include hardware failure, signal noise, photovoltaic power limitation, and so on [21,22]. Moreover, outliers generated by various causes show different characteristics in the electrical parameter curve. Based on the relationship between irradiance and output power, this paper analyzes the distribution characteristics of abnormal data and classifies the types of outliers. Figure 2a shows the PV array monitoring structure. Figure 2b shows the distribution and source of PV array outliers.
According to Figure 2, the relationship between the distribution of outliers and the source is as follows.

1
Type A outliers are "bottom stacked outliers". They are typically caused by a fault or abnormality that cannot be recovered immediately. These faults or anomalies cause the system to continuously generate outliers within a short period of time. The feature of type A outlier is that the output of the array remains zero while the irradiance is normal, and such outliers are caused by: (1) PV arrays or inverter failure; (2) Communication equipment or sensor failure; (3) Power unit shutdown.
In these cases, the output power measurement of the PV array is zero or close to zero. The distribution features are stacked at the bottom of the curve.

2
B-type outliers are "around scattered outliers". They are irregular scattering points near the power curve and are typically caused by faults or abnormalities that can be recovered in a short period of time, including: (1) Communication equipment or sensor signal propagation noise; (2) Random volatility of external inputs; The inaccuracy of MPPT (maximum power point tracking).
The outlier caused by random factors fluctuate randomly around normal data; such outliers will be randomly distributed outside the boundaries of the output curve.
outliers. The causes of outliers in photovoltaic systems mainly include hardware failure, signal noise, photovoltaic power limitation, and so on [21,22]. Moreover, outliers generated by various causes show different characteristics in the electrical parameter curve. Based on the relationship between irradiance and output power, this paper analyzes the distribution characteristics of abnormal data and classifies the types of outliers. Figure 2a shows the PV array monitoring structure. Figure 2b shows the distribution and source of PV array outliers. According to Figure 2, the relationship between the distribution of outliers and the source is as follows.

1
Type A outliers are "bottom stacked outliers". They are typically caused by a fault or abnormality that cannot be recovered immediately. These faults or anomalies cause the system to continuously generate outliers within a short period of time. The feature of type A outlier is that the output of the array remains zero while the irradiance is normal, and such outliers are caused by: (1) PV arrays or inverter failure; (2) Communication equipment or sensor failure; (3) Power unit shutdown.
In these cases, the output power measurement of the PV array is zero or close to zero. The distribution features are stacked at the bottom of the curve.

2
B-type outliers are "around scattered outliers". They are irregular scattering points near the power curve and are typically caused by faults or abnormalities that can be recovered in a short period of time, including: (1) Communication equipment or sensor signal propagation noise; (2) Random volatility of external inputs; (3) The inaccuracy of MPPT (maximum power point tracking).
The outlier caused by random factors fluctuate randomly around normal data; such outliers will be randomly distributed outside the boundaries of the output curve.

Sliding Standard Deviation Mutation Algorithm Principle
In this paper, as mentioned above, the distribution characteristics of PV array outliers form the basis for determining outliers. When outliers emerge in the operational data, data characteristics such as the rate of change, mean value, variance, standard deviation, and variance rate will change, so the appropriate mutation index can be selected to accurately identify the outliers. The standard deviation can objectively and accurately reflect the degree of dispersion of the dataset and is most commonly used in probability statistics as a measure of the degree of statistical distribution [23]. Therefore, the standard deviation is chosen as a suitable indicator of mutation. However, the standard deviation between different groups of the same sample is discrete, which has a negative impact on the accuracy of the final evaluation results. Therefore, this paper introduces the sliding standard deviation. The advantage of the sliding standard deviation is that the design of the sliding window can be used to maintain the continuity of the standard deviation of the data between different groups and to increase the sensitivity of the mutation index [24]. The known set U is divided into m mutually independent subsets Y, where U= {Y 1 , Y 2 , · · · , Y m }, m is the number of Y. Let X = {X s , X 1 } be the target set. X s represents the data subset with normal array power generation performance, X 1 represents the data subset with abnormal array power generation performance, and X meets X s ∩ X 1 = ∅, X s ∪ X 1 =U.
Set the sliding set to Z j = {(x 1 , y 1 ), (x 2 , y 2 ), · · · , (x a , y a )}, where Z j indicates that the sliding set in a subset slides to the jth position, j= 1, 2 · · · n-a+1, i=1, 2 . . . n. a is the total number of data points in the sliding set and 1 < a < n, where n is the total number of subset data points. The known subset Y m = {(x 1 ,y 1 ), (x 2 ,y 2 ), · · · , (x n ,y n )|y i <y i-1, i∈(2,n)}, where x represents irradiance, y represents power, and n is the total number of subset data points. The sliding standard deviation within the group is calculated as follows [25]: where σ m,j is the jth standard deviation value in the subset Y m , a is the total number of data points in the sliding set, and µ is the average value of y i in the sliding set. Then, the change-point analysis is made for the calculation result of the sliding standard deviation, and the standard deviation threshold H is set based on the analysis result: where i represents the ith point in the subset Y m within the standard deviation threshold, i min is the first point, and i max is the last point. After getting the normal point within the threshold of each subset, there is: where X s is the normal data subset of the sample set U and X 1 is the outlier subset of the sample set U.

Algorithm Steps in Detail
The measured irradiance and power data of a PV array are chosen to illustrate the steps of the sliding standard deviation algorithm in detail. Figure 3 shows the flow chart of the algorithm.
The detailed steps of the algorithm are as follows (the solar irradiance and power of PV array have been normalized).
(1) Select raw data U. The selected data contain the measured solar irradiance and power data of the PV array of one year. (2) Divide U into subsets Y m . The maximum value of irradiance data in the sample data is 1100 W/m 2 and the minimum value is 0 W/m 2 , so the range of irradiance is 0-1100 W/m 2 . The data are divided into several subsets according to the irradiance interval. According to the rule of the algorithm, the data of each subset must meet the requirement of minimum data size. This restriction can be adjusted according to the total amount of sample data. The minimum data size of the selected data is 60. Therefore, when the data points in the subset are less than 60, the subset is merged with the previous subset until the data size requirements are met. The irradiance interval is set as T = 10 W/m 2 , and the sample data are divided into 110 subsets Y m that meet the rule. The calculation process of each subset is similar, so the 90th subset is taken as an example, of which the irradiance range is 900-910 W/m 2 . (3) Sort the subset data in descending order. There are 110 power points in the 90th subset, and the data points are arranged in descending order to satisfy y i <y i-1 , i∈ (2,110). The 90th subset is Y 90 = {(x 1 ,y 1 ), (x 2 ,y 2 ),· · · ,(x 110 ,y 110 )}, where x i is the irradiance and y i is the power of the ith data point (x i , y i ). (4) Calculate the sliding standard deviation. Set the data capacity of the sliding to a = 30. The subset data are brought into the sliding set Z j where j = 1, 2, · · · , 81, as shown in Figure 4. Calculate sliding standard deviations of 81 sliding sets. The 81 sliding standard deviations and other specific data are shown in Table 1. The detailed steps of the algorithm are as follows (the solar irradiance and power of PV array have been normalized).
(1) Select raw data U. The selected data contain the measured solar irradiance and power data of the PV array of one year. (2) Divide U into subsets Ym. The maximum value of irradiance data in the sample data is 1100 W/m 2 and the minimum value is 0 W/m 2 ，so the range of irradiance is 0-1100 W/m 2 . The data are divided into several subsets according to the irradiance interval. According to the rule of the algorithm, the data of each subset must meet the requirement of minimum data size. This restriction can be adjusted according to the total amount of sample data. The minimum data size of the selected data is 60. Therefore, when the data points in the subset are less than 60, the subset is merged with the previous subset until the data size requirements are met. The irradiance interval is set as T = 10W/m 2 , and the sample data are divided into 110 subsets Ym that meet the rule. The calculation process of each subset is similar, so the 90th subset is taken as an example, of which the irradiance range is 900-910 W/m 2 . (3) Sort the subset data in descending order. There are 110 power points in the 90th subset, and the data points are arranged in descending order to satisfy yi<yi-1, i∈ (2,110). The 90th subset is Y90 = {(x1,y1), (x2,y2),⋯,(x110,y110)}, where xi is the irradiance and yi is the power of the i th data point (xi, yi). (4) Calculate the sliding standard deviation. Set the data capacity of the sliding to a = 30. The subset data are brought into the sliding set Zj where j = 1, 2, ⋯, 81, as shown in Figure 4. Calculate sliding standard deviations of 81 sliding sets. The 81 sliding standard deviations and other specific data are shown in Table 1.     The box diagram is shown in Figure 5. The statistical distribution of the 110 subsets' sliding standard deviations shows that most of the normal values are concentrated between 0.015 and 0.025. After the threshold is set to 0.02, the sliding standard deviation curve of the subset Y90 is shown in Figure 6a. The curve on the left side of the 20th value and the right side of the 73rd value is significantly upturned with no tendency to stabilize. The steady trend in the middle of the curve indicates that the sliding standard deviation of this part is similar and that the data fluctuation range is small, which are the normal data. By combining the threshold and formula (7), the 20th data point of the subset Y90 and the 103th data point are determined as the mutation points. Therefore, the first 19 data points ("Upper" area in Figure 6a) and last 8 data points ("Lower" area in Figure 6a) are identified as the outliers. The cleaning result is shown in Figure 6b. After the threshold is set to 0.02, the sliding standard deviation curve of the subset Y 90 is shown in Figure 6a. The curve on the left side of the 20th value and the right side of the 73rd value is significantly upturned with no tendency to stabilize. The steady trend in the middle of the curve indicates that the sliding standard deviation of this part is similar and that the data fluctuation range is small, which are the normal data. By combining the threshold and formula (7), the 20th data point of the subset Y 90 and the 103th data point are determined as the mutation points. Therefore, the first 19 data points ("Upper" area in Figure 6a) and last 8 data points ("Lower" area in Figure 6a) are identified as the outliers. The cleaning result is shown in Figure 6b.  Figure 7.
The normal data distribution trend in Figure 7 is consistent with the theoretical curve of the PV array irradiance and power in Figure 1.
indicates that the sliding standard deviation of this part is similar and that the data fluctuation range is small, which are the normal data. By combining the threshold and formula (7), the 20th data point of the subset Y90 and the 103th data point are determined as the mutation points. Therefore, the first 19 data points ("Upper" area in Figure 6a) and last 8 data points ("Lower" area in Figure 6a) are identified as the outliers. The cleaning result is shown in Figure 6b. (6) Data cleaning for data set U. The final results of the above stepwise process are shown in Figure  7. The normal data distribution trend in Figure 7 is consistent with the theoretical curve of the PV array irradiance and power in Figure 1.

Verification
In order to further verify the effectiveness of the outlier cleaning method proposed in this paper, the operational data of a PV array combiner box are chosen for the verification. The operational data are selected from a large grid-connected photovoltaic power station in China. All irradiance data and power data have been normalized.

Case Study
The total installed capacity of the grid-connected PV power station is 39.3MW, and there are 74 centralized inverters installed, each of which is connected to six array combiner boxes. Sixteen PV branches are connected to each array combiner, and each branch has 16 PV modules connected in series. The model number of the module is CEC6-72-300P. The actual operation data of the two array combiner boxes in 2017 are selected as cleaning sample data (numbered 4A, 37A). The data resolution is 10 minutes.
Threshold settings directly influence data cleaning results, and different threshold setting schemes alter the data cleaning results. The threshold setting range is presented in Section 3. The data of the 37A array combiner box are taken as an example here to illustrate the cleaning effect of the sliding standard deviation mutation algorithm of different threshold settings. There is a linear relationship between power and irradiance, so the linear correlation coefficient is introduced as the evaluation index. The calculation results are shown in Table 2.
As shown in Table 2, the threshold setting affects the data deletion rate and linear correlation coefficient of the cleaning result. Within the designed threshold range, the linear correlation coefficient is stable above 99% and increases slightly as the threshold decreases. At the same time, the

Verification
In order to further verify the effectiveness of the outlier cleaning method proposed in this paper, the operational data of a PV array combiner box are chosen for the verification. The operational data are selected from a large grid-connected photovoltaic power station in China. All irradiance data and power data have been normalized.

Case Study
The total installed capacity of the grid-connected PV power station is 39.3 MW, and there are 74 centralized inverters installed, each of which is connected to six array combiner boxes. Sixteen PV branches are connected to each array combiner, and each branch has 16 PV modules connected in series. The model number of the module is CEC6-72-300P. The actual operation data of the two array combiner boxes in 2017 are selected as cleaning sample data (numbered 4A, 37A). The data resolution is 10 min.
Threshold settings directly influence data cleaning results, and different threshold setting schemes alter the data cleaning results. The threshold setting range is presented in Section 3. The data of the 37A array combiner box are taken as an example here to illustrate the cleaning effect of the sliding standard deviation mutation algorithm of different threshold settings. There is a linear relationship between power and irradiance, so the linear correlation coefficient is introduced as the evaluation index. The calculation results are shown in Table 2.
As shown in Table 2, the threshold setting affects the data deletion rate and linear correlation coefficient of the cleaning result. Within the designed threshold range, the linear correlation coefficient is stable above 99% and increases slightly as the threshold decreases. At the same time, the data deletion rate markedly increases as the threshold decreases. A smaller threshold setting causes a denser normal data distribution ( Figure 8) and a larger threshold setting causes a more dispersed normal data distribution. The threshold settings need to be adjusted to the actual application scenario. For example, for the modeling a PV system, the data must be prioritized over a smaller threshold. In this case, the data threshold range for 37A can be set to 0-0.01 to delete some of the discrete, correct data for higher quality and for more densely distributed normal data. In this case study, we aimed at prevent misidentification to the greatest extent possible by analyzing the threshold settings. Table 3 and Figure 8 together show that a smaller threshold setting results in a slight increase in the linear correlation, but the corresponding data deletion amount significantly increases to the point that certain normal points are incorrectly identified as outliers. A larger threshold setting results in less data deletion, and the smaller the linear correlation coefficient, but more outliers are not recognized. The optimal threshold setting range for the 37A data is 0.02 to 0.025. Under the same principle, the optimal threshold setting range for the other two sets of data is 0.02-0.025. The threshold settings need to be adjusted to the actual application scenario. For example, for the modeling a PV system, the data must be prioritized over a smaller threshold. In this case, the data threshold range for 37A can be set to 0-0.01 to delete some of the discrete, correct data for higher quality and for more densely distributed normal data. In this case study, we aimed at prevent misidentification to the greatest extent possible by analyzing the threshold settings. Table 3 and Figure 8 together show that a smaller threshold setting results in a slight increase in the linear correlation, but the corresponding data deletion amount significantly increases to the point that certain normal points are incorrectly identified as outliers. A larger threshold setting results in less data deletion, and the smaller the linear correlation coefficient, but more outliers are not recognized. The optimal threshold setting range for the 37 A data is 0.02 to 0.025. Under the same principle, the optimal threshold setting range for the other two sets of data is 0.02-0.025. Two arrays of data of 4A and 37A are cleaned by the proposed method, and thresholds of the two sets of data are all set to 0.02. The data cleaning results are shown in Figure 9. The outliers of 4A and 37A's data can been effectively identified by the algorithm. The data marked in red in Figure 9 are identified as normal data, which are close to the theoretical PV power curve. The results of the two groups also prove that the bottom-curve stacked and around-curve scattered outliers are effectively identified by the proposed algorithm, which make it feasible and effective for data cleaning in large grid-connected PV arrays. Two arrays of data of 4A and 37A are cleaned by the proposed method, and thresholds of the two sets of data are all set to 0.02. The data cleaning results are shown in Figure 9. The outliers of 4A and 37A's data can been effectively identified by the algorithm. The data marked in red in Figure 9 are identified as normal data, which are close to the theoretical PV power curve. The results of the two groups also prove that the bottom-curve stacked and around-curve scattered outliers are effectively identified by the proposed algorithm, which make it feasible and effective for data cleaning in large grid-connected PV arrays.

Comparison with Other Algorithms
In order to illustrate the performance of the proposed method, it is compared with the quantile method, which is the most common used in data cleaning. In this section, the quantile is set to quartile. The original operation data of 4A and 37A array in the above example are used as data samples for the quantile method and the sliding standard deviation mutation algorithm. Two kinds of methods are used to clean the sample data under different outlier distributions. Then, the cleaning effect, data deletion rate, and linear correlation coefficient are compared.
Three kinds of data conditions (outlier accumulation at the bottom, outlier around the top, and normal outlier distribution) are designed to assess the cleaning effect of the two methods. The cleaning results under different outlier distributions are shown in Figures 10-12.

Comparison with Other Algorithms
In order to illustrate the performance of the proposed method, it is compared with the quantile method, which is the most common used in data cleaning. In this section, the quantile is set to quartile. The original operation data of 4A and 37A array in the above example are used as data samples for the quantile method and the sliding standard deviation mutation algorithm. Two kinds of methods are used to clean the sample data under different outlier distributions. Then, the cleaning effect, data deletion rate, and linear correlation coefficient are compared.
Three kinds of data conditions (outlier accumulation at the bottom, outlier around the top, and normal outlier distribution) are designed to assess the cleaning effect of the two methods. The cleaning results under different outlier distributions are shown in Figures 10-12.

Comparison with Other Algorithms
In order to illustrate the performance of the proposed method, it is compared with the quantile method, which is the most common used in data cleaning. In this section, the quantile is set to quartile. The original operation data of 4A and 37A array in the above example are used as data samples for the quantile method and the sliding standard deviation mutation algorithm. Two kinds of methods are used to clean the sample data under different outlier distributions. Then, the cleaning effect, data deletion rate, and linear correlation coefficient are compared.
Three kinds of data conditions (outlier accumulation at the bottom, outlier around the top, and normal outlier distribution) are designed to assess the cleaning effect of the two methods.    the outlier due to the distribution of outliers. Some data are misjudged, which means some normal data are identified as outliers and some outliers are identified as normal data. The sliding standard deviation algorithm proposed in the paper has not been affected by the outlier distribution, which means that the algorithm can still classify the outliers from the operation data.
To analyze the performance of the proposed method, the raw operation data of the 4A, 37A, and another array numbered 17B is cleaned by the quartile method and the sliding standard deviation mutation algorithm, respectively. Figure 13 shows the data cleaning results for 37A by the two methods. Both methods can identify bottom-stacked outliers, overall. The normal data identified by the quartile method are shifted downwards. The outlier in 37A is stacked in the lower part of the scatter, which influences the quartile method's cleaning effect, resulting in a deviation of normal data recognition. The sliding standard deviation algorithm will not be affected by the distribution of outliers. Table 3 shows the cleaning results by the two methods. The two methods show different cleaning results of the outlier around the normal data. The two methods' linear correlation coefficient of the cleaned data can reach 99%. The data deletion rate of the quartile method for three arrays is 45.43%, 37.24%, and 32.14%, respectively, which is much larger than the 16.38%, 19.87%, and 12.96% data deletion rates of the sliding standard deviation mutation algorithm. Figures 10-13 show that the quartile method identifies some normal data as outliers, so the quartile method has a higher normal data loss rate than the proposed method.
Energies 2019, 12, x 14 of 17  show that in the case of bottom outlier accumulation and outlier around the top, the data identified as normal data by the quartile algorithm are offset to the accumulation side of the outlier due to the distribution of outliers. Some data are misjudged, which means some normal data are identified as outliers and some outliers are identified as normal data. The sliding standard deviation algorithm proposed in the paper has not been affected by the outlier distribution, which means that the algorithm can still classify the outliers from the operation data.
To analyze the performance of the proposed method, the raw operation data of the 4A, 37A, and another array numbered 17B is cleaned by the quartile method and the sliding standard deviation mutation algorithm, respectively. Figure 13 shows the data cleaning results for 37A by the two methods. Both methods can identify bottom-stacked outliers, overall. The normal data identified by the quartile method are shifted downwards. The outlier in 37A is stacked in the lower part of the scatter, which influences the quartile method's cleaning effect, resulting in a deviation of normal data recognition. The sliding standard deviation algorithm will not be affected by the distribution of outliers. Table 3 shows the cleaning results by the two methods. The two methods show different cleaning results of the outlier around the normal data. The two methods' linear correlation coefficient of the cleaned data can reach 99%. The data deletion rate of the quartile method for three arrays is 45.43%, 37.24%, and 32.14%, respectively, which is much larger than the 16.38%, 19.87%, and 12.96% data deletion rates of the sliding standard deviation mutation algorithm. Figures 10-13 show that the quartile method identifies some normal data as outliers, so the quartile method has a higher normal data loss rate than the proposed method. In summary, the data cleaning method based on the sliding standard deviation mutation algorithm can effectively identify bottom-stacked and around-curve scatter data in the array powerirradiance curve of different PV arrays. Although the outliers distort the distribution of the data, the proposed method can still effectively identify the outliers.  In summary, the data cleaning method based on the sliding standard deviation mutation algorithm can effectively identify bottom-stacked and around-curve scatter data in the array power-irradiance curve of different PV arrays. Although the outliers distort the distribution of the data, the proposed method can still effectively identify the outliers.

Conclusions
In order to effectively clean the outlier existing in the PV array, this paper presents the cleaning method based on the sliding standard deviation. The main work of the thesis includes: The distribution characteristics and sources of outliers in the actual operating data of PV arrays are analyzed and summarized. A method based on sliding standard deviation mutation is proposed for identifying PV array outliers. In the case study, the different actual operational data are selected as the cleaning sample. The results prove the availability of the algorithm, and the performance comparison with the quartile algorithm shows the effectiveness of the proposed algorithm. Based on the above works, the highlights of the paper can be summarized as follows: (1) The typical source and distribution features of PV array outliers are revealed. This study finds that the outliers of PV arrays can be sorted into two categories (bottom-stacked data and around-curve scatter data), and then summarizes the source and distribution of these two types outliers. (2) The linear relationship between PV array operational data and environmental variables is used as the foundation of the cleaning method. The proposed method is consistent with the output characteristic of PV array, which can improve the recognition rate. The cleaning results are consistent with the theoretical relationship between irradiance and output power. (3) The outlier data distribution will affect the effect of classical data cleaning algorithm and lead to misidentification. In this paper, it is found that the classical quantile method will be affected by different distributions of the outlier, and such influence will lead to misidentification of the outliers and reduce the accuracy. The method of this paper can avoid the negative effects of outlier distribution by the algorithm design of sliding groups.
The proposed method can be used to preprocess the original operational data of the PV array. By cleaning out irrelevant outliers, high-quality data samples are provided for subsequent work of the operation and maintenance of the power plant.