Combined Correlation and Cluster Analysis for Long ‐ Term Power Quality Data from Virtual Power Plant

: Analysis of the connection between different units that operate in the same area assures always interesting results. During this investigation, the concerned area was a virtual power plant (VPP) that operates in Poland. The main distributed resources included in the VPP are a 1.25 MW hydropower plant and an associated 0.5 MW energy storage system. The mentioned VPP was a source of synchronic, long ‐ term, multipoint power quality (PQ) data. Then, for five related measurement points, the conclusion about the relation in point of PQ was performed using correlation analysis, the global index approach, and cluster analysis. Global indicators were applied in place of PQ parameters to reduce the amount of analyzed data and to check the correlation between phase values. For such a big dataset, the occurrence of outliers is certain, and outliers may affect the correlation results. Thus, to find and exclude them, cluster analysis (k ‐ means algorithm, Chebyshev distance) was applied. Finally, the correlation between PQ global indicators of different measurement points was performed. It assured general information about VPP units’ relation in point of PQ. Under the investigation, both Pearson’s and Spearman’s rank correlation coefficients were considered.


Introduction
The integration of renewable energy sources (RES) and energy storage systems (ESSs) into electrical power networks is increasing scientifically. The important issue is to enable controlling them efficiently. The present approach to assure this is integration into microgrids and virtual power plants (VPPs) [1]. Generally, VPPs are integrated units that are equipped with an effective power flow control system. Virtual power plants consist of generators, loads, and energy storage systems [2]. The research issues that are connected with VPPs may be, e.g., energy management in VPPs [3][4][5]; active and reactive power scheduling optimization [6][7][8]; playing a role in the energy market [9][10][11]; voltage control by RES integrated in VPPs [12][13][14]; localization and management of EESs in VPPs [15][16][17]; power flow control and analysis [18][19][20]. Further, studies so far concern real cases from Europe (Germany [21], Denmark [22], Greece [23], Ireland [24], United Kingdom [25]) or other world regions (Australia [26], China [27], South Korea [28], India [29]). The general methods presented in this article are connected with correlation analysis and cluster analysis, which are applied to power quality (PQ) issues in VPPs. Thus, the literature review concerns the mentioned issues.
The authors of [30] presented a method to identify the PQ disturbance sources based on the monitoring data correlation between different nodes of the power system. The applied correlation methods were based on, e.g., the Pearson coefficient, the Spearman rank coefficient, or the partial correlation coefficient. The correlation is calculated between both the voltage and current indices to extract specific (problematic) nodes. The authors related points. The range of measurement contained 182 days-from 1 May 2020 to 28 October 2020. Therefore, this totals 26 weeks, which represents the operation of the real VPP. To such a big PQ dataset, the concept of global values was introduced. In the literature, it is known under different names such as unified power quality index [40,41]; global power quality index [42,43]; synthetic power quality index [44,45]; or total power quality index [46,47]. Then, the selected global indicators [42,43] were applied in place of PQ parameters to verify their correlation. However, for such a big dataset, the occurrence of outliers is certain. To find and exclude them, CA was applied. The selected algorithm was k-means with Chebyshev distance. Finally, the correlation between PQ global indicators of the five mentioned points was performed. I go on to assure the general information about VPP units' relation in point of PQ. Under the investigation, both Pearson's and Spearman's rank correlation coefficients were considered and compared.
The article's contributions are as follows:  The investigation is based on real synchronic and multipoint measurement from the virtual power plant. The data concern a long-term period of time-26 weeks.  The article proposes using a global indicator in place of classical parameters to reduce the size of the analyzed dataset. The indicators, where applicable, represent three phase values as one value with maintaining features of each phase. Further, global indicators are standardized to the limits of the selected PQ standard to simplify and uniform the comparison and assessment.


The global indicator concerns outside classic 10-min parameters, the extremum 200millisecond values of voltage, and total harmonic distortion in voltage.


Cluster analysis with the k-means algorithm and Chebyshev distance is proposed to detect and exclude the outliers from the dataset, in order to assure that correlation assessment is realized from a general point of view.  The correlation between the different measurement points in view of PQ is realized using both Pearson's and Spearman's rank correlation coefficients.
To summarize, the main aim of the investigation was to conduct correlation analysis for multipoint measurement from the VPP cleaned by the CA method using PQ indicators in place of classic parameters.
The article is organized into five sections. Section 2 introduces the source of data and proposed methodology. Section 3 presents the result of combined correlation and cluster analysis for PQ data from the VPP. Section 4 contains the discussion of results. Section 5 draws conclusions.

Methodology
The methodology part of this research is based on three main issues. The first one is the proposition of using global values in place of classic PQ parameters. Then, correlation analysis using both Pearson's and Spearman's rank coefficients is proposed to define the relation in point of PQ. Finally, the cluster analysis approach is proposed to find and exclude the outliers that have a big impact on the correlation analysis, in order to assure that the correlation results are appreciated. To summarize this methodology, Figure 1 was prepared.

Global Indicators
The present extension of PQ analysis is the application of global values that represent more parameters but maintain their features. This article used the indicators of a global index-the aggregated data index (ADI)-used in, e.g., [42,43]. It includes both classic 10min PQ parameters and extreme values from 10-min data. The indicators used in this investigation were the voltage indicator (I_U), voltage envelope indicator (I_∆U), flicker indicator (I_Pst), unbalance indicator (I_ku2), harmonic indicator (I_THDu), and maximal harmonic indicator (I_THDumax). The indicators were obtained in the indicated manners:  The voltage indicator as a mean value of differences between the nominal values and three phase values of the voltage and standardized to the limit value from the selected standard;  The voltage envelope indicator as a difference between 200 millisecond maximal and minimal values of voltage noticed in the same 10-min aggregation time and standardized to the double limit value of the selected standard;  The unbalance indicator is standardized to the limit value from the selected standard;  The flicker indicator and harmonic indicator as a mean value of the three phase values and standardized to the limit value from the selected standard.
The applied indicators generally assure one value, which represents three phase values. The extension to classic assessment is the application of 200 millisecond extremal values, which proceed from each 10-min datum for voltage and total harmonic distortion in voltage. Additionally, all of them respond to the limit values of the standard. The selected standard during this investigation was the European standard EN 50160 [48], and the limit values were as follows:  Voltage: 10 % of declared voltage;  Short-term flicker severity: 1.0;  Unbalance-2%;  Total harmonic distortion in voltage-8%.

Correlation Analysis
Analysis of the correlation between variables enables describing the relationship between them [49]. In a general way, the correlation seems like an easy process [50]. However, during correlation assessment, there is a need to analyze different circumstances, e.g., [51]:  Linear or nonlinear dependence of data; thus, if nonlinear data are treated as linear, they may affect the final assessment;  Correlation analysis is very sensitive to outliers.
Thus, for different types of data, different correlation coefficients are applied [52]. The commonly used coefficients are Pearson's correlation coefficient and Spearman's rank correlation coefficient. In point of the mathematical equation for both, Equation (1) presents Pearson's coefficient and Equation (2) Spearman's rank coefficient [53].
where:  -Pearson's correlation coefficient;  -Spearman's rank correlation coefficient;  xi, yi-i-th values of observations from populations x and y;  ̅ , -means from populations x and y;  di = r1i − r2i-the difference between the ranks of the corresponding feature values xi and yi;  r1i-rank of the i-th object in the first ordering;  r2i-rank of the i-th object in the second ordering;  n-number of objects under study.
Both coefficients reach values in the range of [−1,1] [54]. The interpretation of the correlation level based on the determined correlation coefficients is presented in Table 1.

Coefficient
Correlation Level Description 0 No correlation High correlation 0.9 | | Strong correlation The crucial element during correlation analysis is to select the appreciated coefficient due to its feature [56]. The Pearson coefficient features are as follows:  The analyzed values must have a distribution comparable to a normal distribution;  It is required that there is a linear relationship between the variables.
The Spearman rank coefficient:  It is more robust to outliers compared to Pearson's correlation coefficient;  It can be used to determine any monotonic relationship, including nonlinear relationships.
Additionally, to verify the correlation, the following hypotheses were defined [57]: The statistic assumes a Student's t-distribution with k = n − 2 steps. The value of the test statistic is determined by comparing the p-value (obtained from the Student's tdistribution) with the assumed significance level α. The most common significance level α is equal to 0.05. To conclude [49]:  If the p-value is less than the significance level (α = 0.05) then decision: reject H0. This means that there is sufficient evidence to conclude that there is a significant relationship between parameters because the correlation coefficient is significantly different from 0.  If the p-value is not less than the significance level (α = 0.05) then decision: do not reject H0. This means that there is insufficient evidence to conclude that there is a significant linear relationship between parameters because the correlation coefficient is not significantly different from 0.

Cluster Analysis to Disclude Outliers
Correlation analysis, especially using Pearson correlation, is very sensitive to outliers. Thus, to obtain general information about relations, the outliers should be excluded. However, excluding data from long-term data that are represented by many parameters may be hard. The proposed solution in this article is cluster analysis (CA), as a representative of data mining techniques [58]. The main aim of clustering is to assure the division of data at the point of their features [59]. Non-hierarchical CA is based on assigning all observations to the earlier known number of clusters in order to maximize/minimize some evaluation criteria [60]. Non-hierarchical methods may be based on different algorithms such as the k-means algorithm, the k-median algorithm, or the expectation maximization (EM) algorithm. Further, one of the issues is to select an appreciative measure of distance. The known measures are, e.g., Euclidean, Manhattan, or Chebyshev.
In this paper, the author suggests using the non-hierarchical approach with the kmeans algorithm with Chebyshev distance. The Chebyshev distance was selected because it is very sensitive to the extreme value of the parameters [61]. It enables finding outliers from the data [62]. The k-means algorithm aims to find the extremum of the objective function using the measure of the distance between objects. The applied k-means algorithm function with Chebyshev is presented in Equation (3) [63,64]: where: OB-matrix of the object belonging to a cluster;  CM-matrix in which a row vector represents the centroids of clusters; eij-element indicating the fact of assignment of the i-th object to the j-th class (cluster);  ai-vector of observations belonging to cluster x;  bi-vector of observations belonging to cluster y.

Results
This section presents the results of four investigation directions. The first concerns the application of global indicators in place of PQ parameters and assessment of their correlation. Then, CA is applied to data division in point of excluding outliers, which affect correlation analysis. Then, the correlation between PQ global indicators is performed between different measurement points in the VPP. The applied coefficients are Pearson and Spearman.

Virtual Power Plant as a Source of Area-Related PQ Data
The source of data used in this article is a VPP that operates in Poland, in the region of Lower Silesia. The VPP operates on a fragment of the distribution network at both medium voltage (MV) and low voltage (LV) levels [64]. It is connected to a 110 kV Polish grid by two substations of 110/20 kV. In this investigation, one MV network was selected that has earth fault current compensation [64]. The main distributed energy resources that are integrated into the investigated VPP are a 1250 kW HPP and a 500 kW battery ESS, which are connected at the MV level to a distribution system.
The simplified scheme of the studied fragment of the VPP is presented in Figure 2. The mentioned elements of the VPP are as follows:  1.25 MW hydropower plant that is denoted as MV_H;  0.5 MW battery ESS that is denoted as MV_E;  20 kV line that connects the HPP and ESS substation to the high-voltage/mediumvoltage substation that is denoted as MV_L;  The representative low-voltage load, which is associated with MV_L, that is denoted as LV_L;  The representative low-voltage load, which is associated with the substation of the HPP and ESS, that is denoted as LV_H&E.
The indicated units of the VPP are monitored by power quality recorders. Power quality recorders are indicated as "R", and their connection is also included in Figure 2. MV_H and MV_E are connected to one node and their PQ recorders use the same voltage transformer. Thus, in this research, they are treated as one point for further investigation, denoted as MV_H&E.
The PQ measurement duration was 182 days (26 weeks). The time aggregation of power quality data was 10 min, so the selected time period should be represented by 26,208 10-min data. However, the coverage of multipoint synchronic data was 97.7% (25,069 10-min data). Additionally, from the indicated dataset, the 10-min values that contain voltage events were excluded in accordance with the flagging concept of IEC 61000-4-30 [65]. The only extension was that the 10-min data were excluded if, in at least one of the measurement points, a voltage event occurred. Finally, the investigated data concerned 24,612 10-min data.

Correlation between PQ Parameters and Global Indices
The first element of the investigation was to compare the correlation between single PQ parameters and their global indicators. The global indicators respond to the mean value of three phases and the limit value in accordance with standard EN 50160 [48]. Generally, the correlation between them should be strong. The results of those correlations are presented in Table 2.  The correlation for the voltage indicator, harmonic indicator, and maximal harmonic indicator was higher than 0.9.  The correlation for the asymmetry indicator was equal to 1 because the indicator is just a standardization to the limit value.  Thus, attention was paid to the correlation of the flicker indicator and the short-term flicker severity for phase L2 in the LV_H&E point, which is presented in Figure 3. As it can be observed, the relation between them has two trends. Therefore, the data should be divided into two groups to exclude outliers, as a step to generalize the results of correlation assessment between different measurement points.

Cluster Analysis to Detect Short-Term Working Conditions
In the result presented in the previous subsection, the correlation has two trends. To section them, CA with the k-means algorithm was applied. The selected measure of the distance was Chebyshev to assure the maximization of differences between the obtained groups (clusters). The input to CA was three phase values for LV_H&E to include the general information about the flicker issue in all phases. The classification was realized with a final number of clusters equal to 2. The number of each cluster is as follows:  Cluster 1: 24,409;  Cluster 2: 203.
Then, for each cluster separately, correlation analysis was performed. The results are presented in Table 3. Additionally, the spread chart for the flicker indicator and the shortterm flicker severity for phase L2 in the LV_H&E point is presented in Figure 4. Therefore, in the rest of the investigation, only data from cluster 1 were investigated, in order to assure that they will respond to the general working conditions and are not affected by short-term specific circumstances.

Correlation Analysis of Units in VPP Using Global Indicator Pearson Coefficient
The next element of the investigation was to check the correlation in point of power quality of the different points of the VPP. The analysis was performed only for data that consisted of cluster 1. The correlation between global indicators of each measurement point to others was compared. The applied coefficient was the Pearson coefficient in Table  4. In the table, the green highlight is used when the correlation strength was at least high. Additionally, the correlation results with a p-value higher than 0.05 are highlighted using *. The indicated observations are as follows:  Table 4. Correlation using Pearson's coefficient between global indicators for the investigated measurement points in the VPP. LV_H&E  LV_L   1  2  3  4  5  6  1  2  3  4  5  6  1  2  3  4  5  6  1  2  3  4  Where: 1-I_U; 2-I_∆U; 3-I_Pst; 4-I_ku2; 5-I_THDu; 6-I_THDumax. Additionally, the green highlight is used for correlations that were at least high, and light green for noticeable correlations. * There is insufficient evidence to conclude that there is a significant relationship because the correlation coefficient is not significantly different from 0: p-value > 0.05.

Correlation Analysis of Units in VPP Using Global Indicators Pearson vs. Spearman Rank Coefficient
Another element of the investigation was to compare different correlation coefficients-Pearson and Spearman rank. The properties of the Spearman rank coefficient are comparable to those of the Pearson coefficient except for one particular feature. In the case of the Pearson coefficient, the relationship between variables must be linear, while the Spearman rank correlation coefficient, unlike the Pearson coefficient, defines any monotonic relationship, including a nonlinear relationship. Thus, the correlation results using the Spearman rank coefficient are presented in Table 5. The observations in point of comparison are as follows:  The results of the Pearson and Spearman rank correlations indicated the same relations in point of at least a high correlation level for the voltage and its envelope issues. The same observation is noticed for the harmonic and maximal harmonic comparison.  Spearman rank coefficients indicated a decrease in the correlation between the voltage envelope and flicker indicators. The results indicated that the correlation was not high, but still noticeable.  The Spearman rank correlation indicated a higher number of noticeable correlations between global indicators. The change is observed generally for flicker indicators between measurement points.  Generally, the Spearman rank coefficient indicated a generally higher correlation between global indicators with a lower than high correlation. At the same time, a lower correlation was found for those in the Pearson assessment with at least a strong correlation level.  1  2  3  4  5  6  1  2  3  4  5  6  1  2  3  4  5  6  1  2  3  4  Where: 1-I_U; 2-I_∆U; 3-I_Pst; 4-I_ku2; 5-I_THDu; 6-I_THDumax. Additionally, the green highlight is used for correlations that were at least high, and light green for noticeable correlations. * There is insufficient evidence to conclude that there is a significant relationship because the correlation coefficient is not significantly different from 0: p-value > 0.05.

Discussion
This article studied a virtual power plant that operates in Lower Silesia in Poland. The realized investigation was based on synchronic measurements from five PQ recorders located at different points of the VPP area. The power quality measurements were realized at medium and low voltage levels. The PQ measurement lasted 182 days-26 weeks of the year 2020. Thus, this measurement represents long-term data from different related points.
The research also concerned the application of global values in point of classical PQ parameters. The used global indicators assure one value, which represents three phase values, as a mean of them. The proposed extension to classic approaches such as those in [66,67] or [68] was based on the application of 200 millisecond extremal values that proceeds from each 10-min aggregation interval for voltage and total harmonic distortion in voltage. Additionally, the indicators were standardized to the limit values of the selected PQ standard EN 50160. However, it is worth noticing that other limits based on a specification of the measurement object may be applied.
The investigation concerned the correlation analysis of global indicators and PQ parameters. Generally, the correlation of them should be strong, due to the mathematical relation between them. The exception should be the envelope indicator because it concerns both minimal and maximal values. The results indicate the specific results for one of the indicators-flicker issues for one of the low-voltage loads. Thus, it was confirmed that specific conditions occurred, and these kinds of "outliers" may affect the general results of correlation between measurement points.
To solve the problem with the outlier, the cluster analysis approach was proposed. Clustering with the k-means algorithm was applied for Chebyshev distance. The Chebyshev distance was selected because it is directed to maximization of the differences between data. The application of k-means clustering enabled detecting and excluding the outliers that were not included during further investigation.
For a clean PQ dataset, the correlation of PQ global indicators between different measurement points was performed. Firstly, the Pearson coefficient was selected, in order to define the linear correlation. An at least high correlation was generally indicated for the voltage indicator of each measurement point. A correlation between harmonic levels (both harmonic and maximal harmonic indicators) was noticed between all measurement points. Additionally, a correlation between the voltage envelope and flicker indicators was noticed for each point separately. The article also concerned the Spearman rank coefficient, in order to verify whether the correlation between parameters has a nonlinear nature. In point of the voltage, voltage envelope, harmonic, and maximal harmonic indicators, the Spearman rank coefficient gave similar results. Generally, the Spearman rank coefficient indicated that there is a noticeable correlation for the flicker indicator between measurement points, which was not noticed by the Pearson coefficient. Finally, the general observation is that the Spearman rank correlation indicated a higher number of noticeable correlations between global indicators than the Pearson coefficient. This may be caused by the nonlinear nature of their correlation.
The proposed methodology was verified on the basis of virtual power plant data. However, it may also be implemented into other objects where there is a need to define the relationship between units in point of the power quality.

Conclusions
The article proposed a combined approach to obtain information from PQ data that proceed from a real virtual power plant. The proposed methodology is based on correlation analysis, cluster analysis, and global values. The proposed approach reduced the amount of analyzed data by application of global indicators, maintaining the main features of classic parameters. Additionally, the indicators proposed an extension based on the application of 200 millisecond extremum values, in order to sensitize the comparison. The proposed cluster analysis excluded outliers that affect correlation results. It enabled realizing inferences in a general manner. Correlation analysis was performed using the Pearson coefficient, in order to assess linear correlations, and the Spearman rank coefficient, in order to consider nonlinear relationships. Concluding, the assessment of correlations plays a vital role because it is the foundation for various modeling techniques. Further, for values, as well as objects, that are highly correlated, it is possible to predict one variable/object based on another.