Power Transformer Fault Diagnosis Based on Dissolved Gas Analysis by Correlation Coe ﬃ cient-DBSCAN

: The transformers work in a complex environment, which makes them prone to failure. Dissolved gas analysis (DGA) is one of the most important methods for oil-immersed transformers’ internal insulation fault diagnosis. In view of the high correlation of the same fault data of transformers, this paper proposes a new method for transformers’ fault diagnosis based on correlation coe ﬃ cient density clustering, which uses density clustering to extrapolate the correlation coe ﬃ cient of DGA data. Firstly, we calculated the correlation coe ﬃ cient of dissolved gas content in the fault transformers oil and enlarged the correlation of the same fault category by introducing the ampliﬁcation coe ﬃ cient, and ﬁnally we used the density clustering method to cluster diagnosis. The experimental results show that the accuracy of clustering is improved by 32.7% compared with the direct clustering judgment without using correlation coe ﬃ cient, which can e ﬀ ectively cluster di ﬀ erent types of transformers fault modes. This method provides a new idea for transformers fault identiﬁcation, and has practical application value.


Introduction
The health of power transformers is very important for the stable operation of power grid. There are a large number of transformers in service, most of which have been put into use for a certain period of time, and there will be internal faults in the long-term aging process. It is of great significance to diagnose all kinds of latent faults in transformers accurately for the stable and normal operation of transformers. Dissolved gas analysis (DGA) is widely used in online diagnosis of oil-immersed transformers because it uses non-electric quantity as a reference and is not affected by electromagnetism [1,2]. The diagnosis process of DGA is generally divided into the extraction of transformer oil samples, the stripping of dissolved gas in oil, the measurement of gas components and the determination of fault category. The determination of fault category is the core process of DGA diagnosis. The central idea is to determine the fault category by the content of gas component in oil [3].
Fault diagnosis technology can be divided into traditional chart query methods and modern intelligent algorithm identification methods [4,5]. Traditional chart query methods include the three-ratio method [6,7], the Duval triangle method [8] and the Pentagon method [9]. These methods are very convenient to use, and do not need programming and complex calculation. When the ratio of gas components is calculated to draw a graph, you can look up the table or the graph to find the corresponding fault category, but there are problems of low accuracy and judgments that are too absolute [10,11]. An intelligent algorithm is a kind of diagnosis method of pattern recognition by a computer. The fault recognition algorithm mainly includes classification and clustering algorithms [12][13][14]. The classification algorithms include support vector machine [15], decision tree and neural network [16], etc. These methods have achieved good results in transformer fault diagnosis. Clustering algorithms like the k-means and [17] fuzzy clustering algorithm [18,19] can accurately cluster the transformer fault data and identify the type of transformer fault. However, there are some shortcomings, such as the need to determine the number of clusters, difficulty building a reasonable membership function, being easily affected by some deviation points, and having a more complex calculation process [20]. The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is a density based clustering algorithm [21,22], the algorithm does not need to determine the number of clusters in advance, can divide the high-density point area into clusters, and effectively filter out the low-density point area. It can realize clustering of any shape in the noisy dataset, and it is widely used in detection and diagnosis. Hou [23] employed DBSCAN to detect textural damage, Li [24] accomplished thermal runaway diagnosis of battery system for electric vehicles by DBSCAN, and Li [25] made the combination of a DBSCAN and symmetrized dot pattern to complete fault diagnosis of rolling bearing. For transformer fault diagnosis, the density difference of each fault category in the Euclidean distance space of DGA data is not obvious, and direct application of DBSCAN to DGA diagnosis is not effective.
In the practical application of DBSCAN in DGA data processing, due to the variety of transformer faults and the vagueness of their Euclidean distance distinction, the classification results are sensitive to the clustering data, and the accuracy of fault identification is low, making the clustering effect unsatisfying, and difficult to get the data classification in line with the engineering practice. In view of the above problems, this paper proposes a transformer fault diagnosis method called Correlation Coefficient's Density-Based Spatial Clustering of Applications with Noise (CCDBSACN), which applies a correlation coefficient to DBSCAN, constructs a partition coefficient characterized by a correlation coefficient, and enlarges the fault characteristics of dissolved gas data in oil with an amplification coefficient, successfully realizing the application of DBSCAN in DGA.
The remaining sections of this paper are organized as follows: Section 2 gives an overview of definition and clustering principle of DBSCAN. Section 3 introduces the correlation coefficient through an analysis of the defects of the traditional DBSCAN and proposes CCDBSCAN. In Section 4, the proposed method is used to cluster and diagnose the DGA data, and the effectiveness and advantages of this method are compared and analyzed. Some conclusions are presented in Section 5.

DBSCAN Method
DBSCAN algorithm was proposed by Martin Ester, Hans Peter Kriegel and others in 1996 [26]. It is a spatial clustering algorithm based on density. In order to accurately describe the algorithm, the following definitions are given first. Definition 1 given dataset D, the Eps neighborhood of an object p refers to the area with object p as the center and Eps as the radius, that is where D is the dataset, Dist(p,q) is the distance between object p and q, which means all objects in dataset D that are not more than Eps away from the object p. Definition 2 given dataset D, the core object refers to the object p∈D whose Eps neighborhood has more objects than the given value MinPts, MinPts is neighborhood density threshold, and core object p satisfies Equation (2): Definition 3 given dataset D, direct density reachability is that if object q is in the Eps neighborhood of object p and object p is the core object, then object q is direct density reachable to object p, that is, there is object q ∈ N Eps (p) and object p satisfies Equation (2). Definition 4 given dataset D, density reachability is that if there is an object chain p 1 , p 2 , . . . , p n ∈ D, for p i (0 < i < n), p i+1 is directly density reachable to object p i , then the object p n is said to be density reachable to object p 1 .
Definition 5 given dataset D, density connection is that if there is object o ∈ D, so that object p and object q are density reachable from object o, then object p and object q are said to be density connected.
Definition 6 given dataset D, a cluster C is a non-empty subset of dataset D, and the following conditions are met: (1) For any object q, if the core object p∈C and the object q is density reachable from the core object p, then the object q∈C.
(2) For any object p, q∈C, object p and object q are density connected. Definition 7 given dataset D, the noise point is the object that does not belong to any cluster. DBSCAN first performs region query on any object o to calculate its Eps neighborhood. If object o is temporarily marked as noise; otherwise, object o and its neighbors are marked as belonging to cluster C 1 , and region query is repeated for each neighbor of o. When the former cluster C 1 cannot be further expanded, any unmarked objects will be selected and a new cluster C 2 will be generated based on the selected objects. Repeat for cluster C i until all objects are marked.
The essence of the density-based clustering algorithm is to find the high-density dataset in the dataset, that is, the average distance between data points in the dataset is small, while there are low-density areas between high-density datasets. The DBSCAN algorithm uses the parameters of Eps and MinPts to determine the threshold of dividing high-density datasets.

Introduction of Correlation Coefficient
The correlation coefficient matrix R is defined as the matrix formed by the correlation coefficients of each parameter vector and all other vectors in the dataset, i.e., where, in Equation (4), R is the correlation coefficient of transformer oil characteristic gas vector t i and t j ; cov(t i , t j ) is the covariance of t i and t j ; D(t i ) and D(t j ) are the covariance of t i and t j , respectively. The classification of faults in Publication 60,599 is according to the main types of faults that can be reliably identified by the equipment after the fault has occurred in service [27]: Partial discharge (PD): under the electric field, partial discharge will be triggered in the area with weak insulation performance in the transformer insulation system [28]. Partial discharge is electric discharge that only partially bridges the insulation between conductors. Corona partial discharge is evidenced by the formation of x-wax [29].
Discharges of low energy (D1): due to sparking, treeing and tracking evidenced by significant paper punctures, carbonization of paper surface plus carbon particles in oil.
Discharges of high energy (D2): with power follow through, evidenced by extensive carbonization, metal fusion, and possible tripping of the equipment. Thermal faults below 300 • C (T1): evidenced by brownish paper. If paper has carbonized, thermal faults above 300 • C but below 700 • C (T2).
Thermal faults above 700 • C (T3): evidenced by oil carbonization, metal coloration, or fusion. For the power transformers, the reasons for its failure include the rationality of its structure design, the quality of its insulation performance, and more importantly, the various stresses it needs to bear in the process of work. These stresses include all kinds of over-voltage and -heating, that is, thermal stress and electrical stress. DGA data of transformers provides information of electric and thermal stress of oil immersed power transformers [30].
Oil and paper are decomposed due to electrical and thermal stresses. Both are insulation materials of transformer. These two stresses can lead to the breakdown of insulating materials and release of gas decomposition products [31]. Griffin [32] described the types of faults associated with these gases. The decomposition of transformer oil produces hydrogen (H 2 ), methane (CH 4 ), acetylene (C 2 H 2 ), ethylene (C 2 H 4 ) and ethane (C 2 H 6 ). CH 4 and C 2 H 6 are related to low-temperature oil breakdown, while C 2 H 4 is related to high-temperature oil breakdown, C 2 H 2 is related to discharge, and H 2 is related to partial discharge.
In order to illustrate the existence of DGA gas content correlation of the same fault transformer and explain the existing problems when clustering directly using correlation, 12 pieces of DGA data were selected in the typical application case of power grid equipment state detection technology prepared by the operation and maintenance department of State Grid Corporation of China [33]. Fault types include T1, T2, T3, PD, D1, D2, and each fault includes two. First, calculate the percentage content of hydrogen and various hydrocarbon gases in the total gas of 12 data, and convert the absolute content of gas into the relative content, and then the correlation analysis is carried out according to Equation (4); finally, the correlation coefficient matrix R is obtained. The line chart of DGA gas content of 12 fault transformers is shown in Figure 1. Discharges of high energy (D2): with power follow through, evidenced by extensive carbonization, metal fusion, and possible tripping of the equipment Thermal faults below 300 ºC(T1): evidenced by brownish paper. If paper has carbonized, thermal faults above 300 ºC but below 700 ºC(T2).
Thermal faults above 700 ºC (T3): evidenced by oil carbonization, metal coloration, or fusion. For the power transformers, the reasons for its failure include the rationality of its structure design, the quality of its insulation performance, and more importantly, the various stresses it needs to bear in the process of work. These stresses include all kinds of over-voltage and -heating, that is, thermal stress and electrical stress. DGA data of transformers provides information of electric and thermal stress of oil immersed power transformers [30].
Oil and paper are decomposed due to electrical and thermal stresses. Both are insulation materials of transformer. These two stresses can lead to the breakdown of insulating materials and release of gas decomposition products [31]. Griffin [32] described the types of faults associated with these gases. The decomposition of transformer oil produces hydrogen (H2), methane (CH4), acetylene (C2H2), ethylene (C2H4) and ethane (C2H6). CH4 and C2H6 are related to low-temperature oil breakdown, while C2H4 is related to high-temperature oil breakdown, C2H2 is related to discharge, and H2 is related to partial discharge.
In order to illustrate the existence of DGA gas content correlation of the same fault transformer and explain the existing problems when clustering directly using correlation, 12 pieces of DGA data were selected in the typical application case of power grid equipment state detection technology prepared by the operation and maintenance department of State Grid Corporation of China [33]. Fault types include T1, T2, T3, PD, D1, D2, and each fault includes two. First, calculate the percentage content of hydrogen and various hydrocarbon gases in the total gas of 12 data, and convert the absolute content of gas into the relative content, and then the correlation analysis is carried out according to Equation (4); finally, the correlation coefficient matrix R is obtained. The line chart of DGA gas content of 12 fault transformers is shown in Figure 1. Calculate the correlation according to Equation (4). It can be seen that the phenomenon of high correlation of DGA data of the same fault exists, and the correlation coefficient is generally greater than 0.95, but there is also a phenomenon, that is, the correlation of the last six types of faults in the matrix is relatively high. For example, the correlation between the seventh and the eighth data of the same fault type is 0.99, but the correlation coefficient between the seventh data and the ninth to twelfth fault data is also as high as 0.92. Due to the low  2  T1  T1  T2  T2  T3  T3  PD  PD  D1  D1 D2 D2 Calculate the correlation according to Equation (4). It can be seen that the phenomenon of high correlation of DGA data of the same fault exists, and the correlation coefficient is generally greater than 0.95, but there is also a phenomenon, that is, the correlation of the last six types of faults in the matrix is relatively high. For example, the correlation between the seventh and the eighth data of the same fault type is 0.99, but the correlation coefficient between the seventh data and the ninth to twelfth fault data is also as high as 0.92. Due to the low differentiation, it will bring some difficulties to clustering. The reason is analyzed, because the seventh and eighth data belong to PD, the characteristic gas is hydrogen, while the ninth, tenth and eleventh, twelfth data belong to D1 and D2, and acetylene is the characteristic gas. However, due to the lower content of acetylene gas, the correlation coefficient of direct calculation data does not highlight the impact of acetylene gas content in the larger hydrogen content change, that is, the direct correlation analysis of the original data will cause the annihilation of sample attribute information. Therefore, the following research has been carried out in this paper: multiply the percentage of ethylene and acetylene content in seventh to tenth data by the amplification factor α 1 , α 2 , and then carry out the correlation analysis.
The line chart of DGA gas content obtained when α 1 = 1, α 2 = 1.35 is shown in Figure 2.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 20 differentiation, it will bring some difficulties to clustering. The reason is analyzed, because the seventh and eighth data belong to PD, the characteristic gas is hydrogen, while the ninth, tenth and eleventh, twelfth data belong to D1 and D2, and acetylene is the characteristic gas. However, due to the lower content of acetylene gas, the correlation coefficient of direct calculation data does not highlight the impact of acetylene gas content in the larger hydrogen content change, that is, the direct correlation analysis of the original data will cause the annihilation of sample attribute information. Therefore, the following research has been carried out in this paper: multiply the percentage of ethylene and acetylene content in seventh to tenth data by the amplification factor 1 2 ,   , and then carry out the correlation analysis.
The line chart of DGA gas content obtained when 1 Figure 2. The correlation coefficient matrix of data 7 to 10 is: It can be seen that when the acetylene content is multiplied by 1.35, the correlation of the same fault category gas remains high, while the correlation of different types of fault gas becomes smaller. For example, the correlation coefficient of the seventh and the tenth data is reduced from 0.92 to 0.88, which makes the differentiation of different fault gas data more obvious.
It can be seen that finding the appropriate amplification coefficient is the next step for research, so that the characteristic gas of each kind of fault can be better reflected in the correlation coefficient. The correlation coefficient matrix of data 7 to 10 is: It can be seen that when the acetylene content is multiplied by 1.35, the correlation of the same fault category gas remains high, while the correlation of different types of fault gas becomes smaller. For example, the correlation coefficient of the seventh and the tenth data is reduced from 0.92 to 0.88, which makes the differentiation of different fault gas data more obvious.
It can be seen that finding the appropriate amplification coefficient is the next step for research, so that the characteristic gas of each kind of fault can be better reflected in the correlation coefficient.

Chaotic Sequence Optimization
Different gases usually play different roles in the process of sample classification. Simulation tests and a large number of field tests show that [3] acetylene is the characteristic gas of discharges, hydrogen is the characteristic gas of partial discharges, and ethylene is the characteristic gas of a thermal fault. However, according to Equation (4), the correlation between two groups of gases cannot reflect the similarity of characteristic gases of different faults, so it is necessary to amplify the less characteristic gases. Therefore, the following calculation method for enlarging the characteristic gases of faults is proposed in this paper.
Suppose that the DGA data is represents the first DGA train vector of T1. The matrix of amplification coefficient is defined, and the amplification matrix Y (n×5) is obtained by multiplying the matrix X (n×5) by the matrix A.
where R ij is the correlation coefficient matrix of type i and type j faults where r ijmm is the correlation coefficient between the m-th DGA vector of class i and the m-th DGA vector of class j. The PC (partition coefficient) can be defined by the above description The AC (aggregation coefficient) is The larger the partition coefficient of class i and j faults indicates that the separation is more obvious. When i j, the larger the partition coefficient indicates that the similarity between class i and j faults is lower. When i = j, the higher the partition coefficient indicates that the data correlation within class i faults is higher.
The idea of this chapter is to use chaos sequence to optimize the amplification coefficient matrix, so that the partition coefficient between different types of faults is the largest and the aggregation coefficient of the same type of faults is the largest, to improve the classification and diagnosis effect of CCDBSCAN.
Chaos sequence optimization is to search chaos variables in a certain range according to the ergodicity and regularity of chaos sequence, so as to make the search of chaos variables jump out of the local optimum and finally reach the global optimum [34]. Based on the idea of information sharing of each orbit, the next iteration is not only determined by the inertia weight, but also affected by the historical information of the orbit and the global historical information of the rest orbit. Due to the introduction of chaotic sequence, the global feasible solution is fully searched, and the global optimization ability of the search method is obviously improved.
In this paper, the logistic model is used to generate chaotic sequences where µ is the control variable, when µ = 4, the system is in a completely chaotic state, and the sequence generated by Equation (10) is a chaotic sequence. The optimization algorithm is a decision-making problem. The objective function of optimization in this paper is the partition coefficient of all kinds of faults, so that when i j, the separation coefficient is the maximum, when i = j, the separation coefficient is the minimum.
The objective function of optimization is 6 k=1 P kk Step1: normalize the data and initialize each constant: number of tracks L, variable dimension N, iteration precision ε, maximum iteration algebra T max .
Step 2: randomly generate n-dimensional optimization variable of historical fitness of all tracks, and the sequence X g = X i corresponding to F g .
Step 4: calculate the adjustment amount ∆X L of the variable, which is composed of three parts: the first part is the inertia component of the sequence; the second part is the vector difference between the value of each track and the historical optimal value of the current track; the third part is the vector difference.
between the value of each track and the historical optimal value of all tracks; finally, the adjustment amount of the optimization variable is obtained as follows: In Equation (12), λ 1 , λ 2 , λ 3 represents the weight of three parts respectively. In the early stage of the iterative process, λ 1 value is enlarged to ensure the diversity of each track, and in the later stage of the iterative process, λ 3 value is enlarged to ensure the fusion of each track information. Three weight variables are defined as: Step 5: use Equation (10) to update the L-group chaotic sequence, map it to the variable value space, and use Equation (11) to calculate its fitness. If the fitness of the chaotic sequence in a certain orbit is better than that after iteration, then update the orbit.
Step 6: judge whether the iteration meets the end condition. If it meets the condition, the optimization is ended. If it does not, return to step 3 to continue the iteration.
At the end of this process, the optimized amplification coefficient matrix will be obtained.

Improved CCDBSCAN Diagnosis Method
As the original DBSCAN determines that the standard of each cluster is to calculate the Euclidean distance between each point, the data density represented by the Euclidean distance is used for clustering. However, due to the high correlation of the same fault type data, rather than Euclidean distance, the clustering based on correlation density is more reasonable and accurate. The specific diagnosis methods are as follows: Definition 1 (Eps neighborhood) the Eps neighborhood of an object p refers to the region with the object p as the center and the correlation coefficient greater than Eps, N Eps (p) = q ∈ D r(p, q) ≥ Eps (14) where D is the dataset, r (p, q) is the correlation coefficient between object p and q, which means all objects in dataset D whose correlation coefficient with object p is not less than Eps.
Other definitions remain unchanged, and the flow chart of CCDBSCAN clustering algorithm is shown in Figure 3.
Other definitions remain unchanged, and the flow chart of CCDBSCAN clustering algorithm is shown in Figure 3. Using DBSCAN to deal with DGA data unsupervised, analyzing the correlation and dispersion between the sample vectors of DGA data, classifying according to the correlation of the samples, so that the DGA data samples of the same fault set are as similar as possible, and the data samples of different fault sets are as different as possible. This method is called CCDBSCAN.

Example Analysis and Results
The DGA data used in this section are from the record of fault transformer data of Huzhou Power Supply Company of Zhejiang electric power company of State Grid of China, the case data of IEC TC 10 dataset [27] and typical application cases of power grid equipment state detection technology [33]. The data cover six kinds of faults: T1, T2, T3, PD, D1 and D2. There are more than 2000 pieces of data collected. These data sources are numerous and scattered, so it is necessary to filter and sort out the data. First, initialize the data, calculate the percentage of hydrogen and various hydrocarbon gases in the gas; then, classify the data according to the fault category, calculate the mean value and standard deviation of each fault data, and eliminate the data beyond the mean value Using DBSCAN to deal with DGA data unsupervised, analyzing the correlation and dispersion between the sample vectors of DGA data, classifying according to the correlation of the samples, so that the DGA data samples of the same fault set are as similar as possible, and the data samples of different fault sets are as different as possible. This method is called CCDBSCAN.

Example Analysis and Results
The DGA data used in this section are from the record of fault transformer data of Huzhou Power Supply Company of Zhejiang electric power company of State Grid of China, the case data of IEC TC 10 dataset [27] and typical application cases of power grid equipment state detection technology [33]. The data cover six kinds of faults: T1, T2, T3, PD, D1 and D2. There are more than 2000 pieces of data collected. These data sources are numerous and scattered, so it is necessary to filter and sort out the data. First, initialize the data, calculate the percentage of hydrogen and various hydrocarbon gases in the gas; then, classify the data according to the fault category, calculate the mean value and standard deviation of each fault data, and eliminate the data beyond the mean value twice the standard deviation. After several rounds of screening, 60 pieces of DGA data are finally left for clustering test in this paper, of which 10 pieces are for each fault type, and each data contains five attributes: H 2 , CH 4 , C 2 H 6 , C 2 H 4 and C 2 H 2 .
In this section, the CCDBSCAN is realized with MATLAB 2018b software system. All programs are implemented by hardware of Core i7-4710MQ CPU, memory 8G and hard disk 1T.

CCDBSCAN Algorithm Steps
The fault diagnosis method of CCDBSCAN is divided into the following steps: (1) Data initialization: calculate the content percentage data from the DGA gas content data to be collected. The equation used for data initialization is where x 1 is the hydrogen content, in uL/L, x 2 is the methane content, in uL/L, x 3 is the ethane content, in uL/L, x 4 is the ethylene content, in uL/L, x 5 is the acetylene content, in uL/L. (2) Using chaotic sequence to get its amplification coefficient.

Analysis of Chaos Sequence Optimization Results
The parameters of chaos sequence optimization are set as follows: the number of orbits is 1500, the dimension of variables is 5, the maximum number of iterations is 250, and the minimum error is 10 −3 ; where the fitness function is It can be seen in Figure 4 that the fitness function of the optimization curve of chaotic sequence decreases rapidly and stays at the local minimum point for a short time. Due to the chaos of its population, its convergence is faster and the optimal fitness found is 11. 31 The correlation analysis of DGA data after multiplying the 7th to 10th data by the amplification coefficient is shown in the Figure 5.  The correlation analysis of DGA data after multiplying the 7th to 10th data by the amplification coefficient is shown in the Figure 5.  The correlation coefficient matrix of DGA data in Articles 7 to 10 is as follows: The effect of amplification coefficient is analyzed by aggregation coefficient of various faults, and aggregation coefficient of various faults of original data is shown in Table 1   The correlation coefficient matrix of DGA data in Articles 7 to 10 is as follows: The effect of amplification coefficient is analyzed by aggregation coefficient of various faults, and aggregation coefficient of various faults of original data is shown in It can be seen that the aggregation coefficient of six kinds of faults is relatively large, while that of DP, D1 and D2 is relatively small, which increases the difficulty of density clustering.
Aggregation coefficient of six kinds of fault data processed by amplification coefficient is shown in Table 2. By applying the amplification coefficient to all fault datasets for analysis, and comparing the aggregation coefficient of all kinds of faults multiplied by the optimized amplification coefficient, it is found that the aggregation coefficient of T1 and T2 are decreased by 1.0% and 1.2%, respectively. The aggregation coefficients of T3, PD, D1 and D2 have increased, among which PD, D1 and D2 have increased significantly, which are 22.0%, 9.8% and 13.7%, respectively. The results show that the correlation coefficient of each fault category is significantly improved after the gas content of each component is optimized, which is more conducive to cluster analysis.

CCDBSCAN Method Classification Result Analysis
The processed DGA data were used in CCDBSCAN analysis, and compared with the original DBSCAN method. The comparative analysis indexes were accuracy, precision, and recall, and were characterized by confusion matrix.
(1) DBSCAN analysis After 60 groups of DGA data of fault transformers are normalized, DBSCAN method is directly used for cluster analysis. The cluster graph and confusion matrix are shown in Figure 6 and Table 3. By applying the amplification coefficient to all fault datasets for analysis, and comparing the aggregation coefficient of all kinds of faults multiplied by the optimized amplification coefficient, it is found that the aggregation coefficient of T1 and T2 are decreased by 1.0% and 1.2%, respectively. The aggregation coefficients of T3, PD, D1 and D2 have increased, among which PD, D1 and D2 have increased significantly, which are 22.0%, 9.8% and 13.7%, respectively. The results show that the correlation coefficient of each fault category is significantly improved after the gas content of each component is optimized, which is more conducive to cluster analysis.

CCDBSCAN Method Classification Result Analysis
The processed DGA data were used in CCDBSCAN analysis, and compared with the original DBSCAN method. The comparative analysis indexes were accuracy, precision, and recall, and were characterized by confusion matrix.
1) DBSCAN analysis After 60 groups of DGA data of fault transformers are normalized, DBSCAN method is directly used for cluster analysis. The cluster graph and confusion matrix are shown in Figure 6 and Table 3.    (2) CCDBSCAN analysis After 60 groups of DGA data of fault transformers are normalized, CCDBSCAN method is used for cluster analysis. The cluster graph and confusion matrix are shown in Figure 7 and Table 4.

Analysis of Fault Diagnosis Results
After the DGA data of different fault types are collected, the CCDBSCAN method is used to extract the relevant fault vector feature set, and the fault state diagnosis of transformers is realized by the feature set.
Typical application cases of power grid equipment state detection technology prepared by the operation and maintenance department of State Grid Corporation of China [33], data in this book are collected and sorted out under the actual operation conditions of the transformer, and illustrated with field drawings, so it is quite convincing to select the cases for diagnosis and analysis. In this paper, 30 DGA data including six fault modes are obtained by selecting the fault transformers with clear fault types through field inspection. See Table 5 for DGA data and fault types. The diagnosis effect is compared and analyzed by using the fault modes clustering in this paper and IEC 60,599 method. Table 5. DGA data fault type and IEC60599 diagnosis results.    By comparing the confusion matrix of the two methods, we can see that the accuracy of CCDBSCAN method is 90%, which is 31% higher than that of original DBSCAN method. The accuracy of original DBSCAN method in clustering PD, D1 and D2 is very low, and the accuracy is only 33%, which is caused by the less characteristic gas content of these three faults. Accuracy = 10 + 9 + 10 + 9 + 8 + 8 60 = 90%

Number H2(ul/l) CH4(ul/l) C2H6(ul/l) C2H4(ul/l) C2H2(ul/l) IEC60599
Through comparative observation of Figures 6 and 7, it is found that the most significant difference between Figures 6 and 7 is the clustering result in area 1. In Figure 6, area 1 is clustered into one category only, and there are five data that are not successfully classified. In Figure 7, it is successfully clustered into two categories. In fact, the data in area 1 are two kinds of fault data: D1 and D2. Among them, there are 9 data in D1 and 10 data in D2. Because the content of characteristic gas in D1 and D2 fault is not obvious, their Euclidean distance is not much different, which makes the original DBSCAN method unable to distinguish them effectively. This method enlarges the characteristic gas of D1 and D2 faults, makes them have obvious differences, and successfully distinguishes the faults with small similarity difference before, greatly improves the clustering accuracy of the method.

Analysis of Fault Diagnosis Results
After the DGA data of different fault types are collected, the CCDBSCAN method is used to extract the relevant fault vector feature set, and the fault state diagnosis of transformers is realized by the feature set.
Typical application cases of power grid equipment state detection technology prepared by the operation and maintenance department of State Grid Corporation of China [33], data in this book are collected and sorted out under the actual operation conditions of the transformer, and illustrated with field drawings, so it is quite convincing to select the cases for diagnosis and analysis. In this paper, 30 DGA data including six fault modes are obtained by selecting the fault transformers with clear fault types through field inspection. See Table 5 for DGA data and fault types. The diagnosis effect is compared and analyzed by using the fault modes clustering in this paper and IEC 60,599 method. The above 30 data to be diagnosed are initialized with Equation (15), that is, after the absolute content of five gases is calculated as the percentage content of gases, the correlation coefficients between them and the six fault sets that have been clustered are calculated, and CCDBSCAN is used for clustering. Select T3, PD, D1 and D2 faults (corresponding to the 4th, 9th, 18th and 25th data in the table) to illustrate the clustering process and the improvement effect of CCDBSCAN. See Figure 8 for the line chart of correlation coefficient. The data in Figure 8a is T3 data, and the correlation coefficients of the data with six clusters are calculated. It is found that the correlation coefficients of cluster 2 and cluster 3 of the original data are high, and the maximum correlation coefficient with cluster 2 reaches 0.956, which will interfere with the density clustering. The correlation coefficient between the characteristic gas of the T3 data and six clusters is calculated after being amplified. The maximum correlation coefficient between the characteristic gas of the T3 data and the cluster 2 is reduced to 0.901, which improves the accuracy of the density clustering algorithm. Figure 8b also shows the same characteristics as 8a, especially Figure  8c,d. Between the low-(8c) and high-energy discharge faults (8d), the difference of the original data correlation coefficient line is not obvious. When the original data are used for clustering in Figure 8c, the low-energy discharge DGA data of 8c will be mistakenly clustered into the high-energy discharge fault, resulting in misdiagnosis. However, after applying the method to the gas data amplification, the correlation between the data and the high-energy discharge cluster is obviously reduced. There were four data correlation coefficients between the data and the high-energy discharge cluster that were more than 0.95, and now their correlation coefficients are reduced to 0.9. The correlation coefficient of the low-energy discharge cluster is different from that of high energy discharge cluster, and this data is successfully diagnosed as low energy discharge fault. See Table 6 for CCDBSCAN results and average correlation coefficient of all 30 data. cluster1 cluster2 cluster3 cluster4 cluster5 cluster6 type The data in Figure 8a is T3 data, and the correlation coefficients of the data with six clusters are calculated. It is found that the correlation coefficients of cluster 2 and cluster 3 of the original data are high, and the maximum correlation coefficient with cluster 2 reaches 0.956, which will interfere with the density clustering. The correlation coefficient between the characteristic gas of the T3 data and six clusters is calculated after being amplified. The maximum correlation coefficient between the characteristic gas of the T3 data and the cluster 2 is reduced to 0.901, which improves the accuracy of the density clustering algorithm. Figure 8b also shows the same characteristics as 8a, especially Figure 8c,d. Between the low-(8c) and high-energy discharge faults (8d), the difference of the original data correlation coefficient line is not obvious. When the original data are used for clustering in Figure 8c, the low-energy discharge DGA data of 8c will be mistakenly clustered into the high-energy discharge fault, resulting in misdiagnosis. However, after applying the method to the gas data amplification, the correlation between the data and the high-energy discharge cluster is obviously reduced. There were four data correlation coefficients between the data and the high-energy discharge cluster that were more than 0.95, and now their correlation coefficients are reduced to 0.9. The correlation coefficient of the low-energy discharge cluster is different from that of high energy discharge cluster, and this data is successfully diagnosed as low energy discharge fault. See Table 6 for CCDBSCAN results and average correlation coefficient of all 30 data. It can be seen from the table that among the 30 fault data clusters of CCDBSCAN, the diagnosis results of 26 data are completely consistent with the actual fault types. In the other four data, two data are misjudged, No.5 data was actually D2, but is diagnosed as T3, No.17 data is actually T3 but is diagnosed as PD, and the rest of the diagnosis results of No.s7 and 8 data are close to the actual results.
Although the thermal fault is artificially divided into three types with 300 and 700 • C as clear boundaries, T1, T2, T3, there is in fact no such clear physical boundary. The distinction between temperature of T1, T2, T3 is a qualitative fuzzy description, and there is a transition state between them. Similarly, D1 and D2 are distinguished by the level of discharge energy. There is also a transition state between them, and there is no clear physical boundary. Therefore, using a correlation coefficient to describe the similarity degree of DGA data to each fault, rather than an absolute diagnosis result, may be more practical in engineering. The diagnosis results of the above two data can still be regarded as valid.
Therefore, Table 6 shows a high accuracy rate of fault diagnosis. According to the diagnostic rule [35] of IEC60599−2015 (see Table 5 for the diagnostic results), 10 of the above 30 DGA data were diagnosed as errors, with the numbers of 2, 5, 7, 8, 9, 10, 14, 19, 23 and 27 respectively. The accuracy rate of fault diagnosis was relatively low.

Conclusions
DBSCAN is an important method of clustering algorithm, but in transformer fault diagnosis, due to the difference of characteristic gas content used to distinguish each fault and the vagueness of their Euclidean distance distinction, the effect of DBSCAN method directly applied to fault diagnosis is not satisfying. In this paper, according to the correlation characteristics of the same kind of fault data of the transformer, the aggregation coefficient is designed to represent the similarity degree and the cluster diagnosis of CCDBSCAN is completed by optimizing and amplifying the fault characteristics. Through calculation, the following conclusions are obtained: (1) The method proposed in this paper is different from the traditional Dissolved Gas Analysis in oil; we introduce the concept of correlation coefficient into cluster analysis, and the aggregation coefficient is constructed to represent the similarity degree of the data. Through the optimized amplification coefficient, some gas which is important but less in content gets amplified, successfully making the correlation coefficient of dissolved gas in oil with the same fault higher than before. (2) By introducing the correlation coefficient into the DBSCAN method, the accuracy of clustering is improved by 31%, which successfully solved the problem of low accuracy of DBSCAN method in clustering. When used in fault diagnosis, the similarity between test set and each fault can be represented by the correlation coefficient instead of a simple diagnosis result, which is more in line with the engineering practice. (3) Using the correlation coefficient to represent the similarity degree of data, and the CCDBSCAN method for clustering, the accuracy of fault diagnosis is significantly improved compared with the iec60599-2015 method, providing a better prospect for application.