3.1. Introduction of Correlation Coefficient
The correlation coefficient matrix
R is defined as the matrix formed by the correlation coefficients of each parameter vector and all other vectors in the dataset, i.e.,
where, in Equation (4),
R is the correlation coefficient of transformer oil characteristic gas vector
ti and
tj; cov(
ti,
tj) is the covariance of
ti and
tj;
D(
ti) and
D(
tj) are the covariance of
ti and
tj, respectively.
The classification of faults in Publication 60,599 is according to the main types of faults that can be reliably identified by the equipment after the fault has occurred in service [
27]:
Partial discharge (PD): under the electric field, partial discharge will be triggered in the area with weak insulation performance in the transformer insulation system [
28]. Partial discharge is electric discharge that only partially bridges the insulation between conductors. Corona partial discharge is evidenced by the formation of x-wax [
29].
Discharges of low energy (D1): due to sparking, treeing and tracking evidenced by significant paper punctures, carbonization of paper surface plus carbon particles in oil.
Discharges of high energy (D2): with power follow through, evidenced by extensive carbonization, metal fusion, and possible tripping of the equipment.
Thermal faults below 300 °C (T1): evidenced by brownish paper. If paper has carbonized, thermal faults above 300 °C but below 700 °C (T2).
Thermal faults above 700 °C (T3): evidenced by oil carbonization, metal coloration, or fusion.
For the power transformers, the reasons for its failure include the rationality of its structure design, the quality of its insulation performance, and more importantly, the various stresses it needs to bear in the process of work. These stresses include all kinds of over-voltage and -heating, that is, thermal stress and electrical stress. DGA data of transformers provides information of electric and thermal stress of oil immersed power transformers [
30].
Oil and paper are decomposed due to electrical and thermal stresses. Both are insulation materials of transformer. These two stresses can lead to the breakdown of insulating materials and release of gas decomposition products [
31]. Griffin [
32] described the types of faults associated with these gases. The decomposition of transformer oil produces hydrogen (H
2), methane (CH
4), acetylene (C
2H
2), ethylene (C
2H
4) and ethane (C
2H
6). CH
4 and C
2H
6 are related to low-temperature oil breakdown, while C
2H
4 is related to high-temperature oil breakdown, C
2H
2 is related to discharge, and H
2 is related to partial discharge.
In order to illustrate the existence of DGA gas content correlation of the same fault transformer and explain the existing problems when clustering directly using correlation, 12 pieces of DGA data were selected in the typical application case of power grid equipment state detection technology prepared by the operation and maintenance department of State Grid Corporation of China [
33]. Fault types include T1, T2, T3, PD, D1, D2, and each fault includes two. First, calculate the percentage content of hydrogen and various hydrocarbon gases in the total gas of 12 data, and convert the absolute content of gas into the relative content, and then the correlation analysis is carried out according to Equation (4); finally, the correlation coefficient matrix
R is obtained. The line chart of DGA gas content of 12 fault transformers is shown in
Figure 1.
Calculate the correlation according to Equation (4).
It can be seen that the phenomenon of high correlation of DGA data of the same fault exists, and the correlation coefficient is generally greater than 0.95, but there is also a phenomenon, that is, the correlation of the last six types of faults in the matrix is relatively high. For example, the correlation between the seventh and the eighth data of the same fault type is 0.99, but the correlation coefficient between the seventh data and the ninth to twelfth fault data is also as high as 0.92. Due to the low differentiation, it will bring some difficulties to clustering. The reason is analyzed, because the seventh and eighth data belong to PD, the characteristic gas is hydrogen, while the ninth, tenth and eleventh, twelfth data belong to D1 and D2, and acetylene is the characteristic gas. However, due to the lower content of acetylene gas, the correlation coefficient of direct calculation data does not highlight the impact of acetylene gas content in the larger hydrogen content change, that is, the direct correlation analysis of the original data will cause the annihilation of sample attribute information. Therefore, the following research has been carried out in this paper: multiply the percentage of ethylene and acetylene content in seventh to tenth data by the amplification factor
, and then carry out the correlation analysis.
The line chart of DGA gas content obtained when
is shown in
Figure 2.
The correlation coefficient matrix of data 7 to 10 is:
It can be seen that when the acetylene content is multiplied by 1.35, the correlation of the same fault category gas remains high, while the correlation of different types of fault gas becomes smaller. For example, the correlation coefficient of the seventh and the tenth data is reduced from 0.92 to 0.88, which makes the differentiation of different fault gas data more obvious.
It can be seen that finding the appropriate amplification coefficient is the next step for research, so that the characteristic gas of each kind of fault can be better reflected in the correlation coefficient.
3.2. Chaotic Sequence Optimization
Different gases usually play different roles in the process of sample classification. Simulation tests and a large number of field tests show that [
3] acetylene is the characteristic gas of discharges, hydrogen is the characteristic gas of partial discharges, and ethylene is the characteristic gas of a thermal fault. However, according to Equation (4), the correlation between two groups of gases cannot reflect the similarity of characteristic gases of different faults, so it is necessary to amplify the less characteristic gases. Therefore, the following calculation method for enlarging the characteristic gases of faults is proposed in this paper.
Suppose that the DGA data is
of
n fault transformers is collected, among which
is the typical fault dataset of transformers,
c = 1 is the T1 dataset,
c = 2 is the T2 fault dataset,
c = 3 is the T3 dataset,
c = 4 is the PD dataset,
c = 5 is the D1 dataset,
c = 6 is the D2 dataset.
When
,
where
represents the first DGA train vector of T1.
The matrix of amplification coefficient
is defined, and the amplification matrix
is obtained by multiplying the matrix
by the matrix
.
where the row correlation coefficient
of
is
where
is the correlation coefficient matrix of type
i and type
j faults
where
is the correlation coefficient between the m-th DGA vector of class
i and the m-th DGA vector of class
j.
The PC (partition coefficient) can be defined by the above description
The AC (aggregation coefficient) is
The larger the partition coefficient of class i and j faults indicates that the separation is more obvious. When , the larger the partition coefficient indicates that the similarity between class i and j faults is lower. When , the higher the partition coefficient indicates that the data correlation within class i faults is higher.
The idea of this chapter is to use chaos sequence to optimize the amplification coefficient matrix, so that the partition coefficient between different types of faults is the largest and the aggregation coefficient of the same type of faults is the largest, to improve the classification and diagnosis effect of CCDBSCAN.
Chaos sequence optimization is to search chaos variables in a certain range according to the ergodicity and regularity of chaos sequence, so as to make the search of chaos variables jump out of the local optimum and finally reach the global optimum [
34]. Based on the idea of information sharing of each orbit, the next iteration is not only determined by the inertia weight, but also affected by the historical information of the orbit and the global historical information of the rest orbit. Due to the introduction of chaotic sequence, the global feasible solution is fully searched, and the global optimization ability of the search method is obviously improved.
In this paper, the logistic model is used to generate chaotic sequences
where
is the control variable, when
, the system is in a completely chaotic state, and the sequence generated by Equation (10) is a chaotic sequence.
The optimization algorithm is a decision-making problem. The objective function of optimization in this paper is the partition coefficient of all kinds of faults, so that when , the separation coefficient is the maximum, when , the separation coefficient is the minimum.
The objective function of optimization is
Step1: normalize the data and initialize each constant: number of tracks L, variable dimension N, iteration precision ε, maximum iteration algebra Tmax.
Step 2: randomly generate n-dimensional optimization variable
of
L-orbit number, and calculate the fitness
on each orbit from Equation (11).
Step 3: determine and update the minimum value of historical fitness
of each track, record
corresponding to
, and form
determine and update the minimum value
of historical fitness of all tracks, and the sequence
corresponding to
.
Step 4: calculate the adjustment amount
of the variable, which is composed of three parts: the first part is the inertia component of the sequence; the second part is the vector difference
between the value of each track and the historical optimal value of the current track; the third part is the vector difference.
between the value of each track and the historical optimal value of all tracks; finally, the adjustment amount of the optimization variable is obtained as follows:
In Equation (12),
represents the weight of three parts respectively. In the early stage of the iterative process,
value is enlarged to ensure the diversity of each track, and in the later stage of the iterative process,
value is enlarged to ensure the fusion of each track information. Three weight variables are defined as:
Step 5: use Equation (10) to update the L-group chaotic sequence, map it to the variable value space, and use Equation (11) to calculate its fitness. If the fitness of the chaotic sequence in a certain orbit is better than that after iteration, then update the orbit.
Step 6: judge whether the iteration meets the end condition. If it meets the condition, the optimization is ended. If it does not, return to step 3 to continue the iteration.
At the end of this process, the optimized amplification coefficient matrix will be obtained.
3.3. Improved CCDBSCAN Diagnosis Method
As the original DBSCAN determines that the standard of each cluster is to calculate the Euclidean distance between each point, the data density represented by the Euclidean distance is used for clustering. However, due to the high correlation of the same fault type data, rather than Euclidean distance, the clustering based on correlation density is more reasonable and accurate. The specific diagnosis methods are as follows:
Definition 1 (
Eps neighborhood) the
Eps neighborhood of an object
p refers to the region with the object
p as the center and the correlation coefficient greater than
Eps,
where
D is the dataset,
r (
p,
q) is the correlation coefficient between object
p and
q, which means all objects in dataset
D whose correlation coefficient with object
p is not less than
Eps.
Other definitions remain unchanged, and the flow chart of CCDBSCAN clustering algorithm is shown in
Figure 3.
Using DBSCAN to deal with DGA data unsupervised, analyzing the correlation and dispersion between the sample vectors of DGA data, classifying according to the correlation of the samples, so that the DGA data samples of the same fault set are as similar as possible, and the data samples of different fault sets are as different as possible. This method is called CCDBSCAN.