Abnormality Detection of Cast-Resin Transformers Using the Fuzzy Logic Clustering Decision Tree

: Failures of cast-resin transformers not only reduce the reliability of power systems, but also have great e ﬀ ects on power quality. Partial discharges (PD) occurring in epoxy resin insulators of high-voltage electrical equipment will result in harmful e ﬀ ects on insulation and can cause power system blackouts. Pattern recognition of PD is a useful tool for improving the reliability of high-voltage electrical equipment. In this work, a fuzzy logic clustering decision tree (FLCDT) is proposed to diagnose the PD concerning the abnormal defects of cast-resin transformers. The FLCDT integrates a hierarchical clustering scheme with the decision tree. The hierarchical clustering scheme uses splitting attributes to divide the data set into suspended clusters according to separation matrices. The hierarchical clustering scheme is regarded as a preprocessing stage for classiﬁcation using a decision tree. The whole data set is divided by the hierarchical clustering scheme into some suspended clusters, and the patterns in each suspended cluster are classiﬁed by the decision tree. The FLCDT was successfully adopted to classify the aberrant PD of cast-resin transformers. Classiﬁcation results of FLCDT were compared with two software packages, See5 and CART. The FLCDT performed much better than the CART and See5 in terms of classiﬁcation precisions.


Introduction
The power transformer is an important equipment in a power system, which directly affects the safety of the power station and the safe operation of the power grid. Among them, the cast-resin transformer provides the products numerous excellent characters such as low no-load loss, oilless, anti-flaming, maintenance-free, good moisture resistance and crazing resistance, etc. The cast-resin transformer is perfectly matched to the requirement on inflammable and explosive site such as commercial center, high-tech factory, hospital, underground, airport, train station, tower building, industrial and mining enterprise, etc. Disturbances of power quality will result in significant financial consequences to network operators and customers. Since many uncertainties are involved, it is difficult to obtain exact financial losses due to poor power quality. Therefore, online monitoring of the cast-resin transformers has been an important challenge for power engineers. Failures of cast-resin transformers not only reduce reliability of power system, but also have great effects on power quality. Power engineers are devoted to intensifying diagnosis on the cast-resin transformer for discovering hidden troubles timely and guaranteeing the normal operation of the cast-resin transformer. Partial discharge (PD) is one of the main causes which leads to internal insulation deterioration of the cast-resin transformer. Online monitoring of PD can reduce the risk of insulation failure of cast-resin 2 of 19 transformers [1]. There are many methods, such as ultrasound, acoustic emission, electrical contact, optical and radio frequency sensing, could be used to detect and locate PD in a cast-resin transformer [2]. For electrical detection, UHF antenna is widely used in the PD measurements because it is more sensitive than other methods with regard to the noise issue.
PD is a localized electrical discharge that occurs repetitively in a small region. In general, PD can be categorized into six forms from their occurring causes: corona discharge, surface discharge, internal discharge, electrical tree, floating partial discharge and contact noise. Corona discharge takes place at atmospheric pressure in the presence of inhomogeneous fields. Surface discharge appears in arrangements with tangential field distribution along the boundary of two different insulation materials. Internal discharge occurs within cavities or voids inside solid or liquid dielectrics. Electric trees occur at points where gas voids, impurities, mechanical defects or conducting projections cause excessive local electrical field stresses within small regions of the dielectric. Floating PD occurs when there is an ungrounded conductor within the electric field between conductor and ground. Contact noise occurs if the ground connection to a bushing is poor.
PD occurs in high-voltage electrical equipment, such as cables, transformers, motors and generators. It is a kind of very small spark that occurs due to a high electrical field. Since a PD occurring in high-voltage electrical equipment has a specific pattern, pattern recognition of PD is a useful tool for improving the reliability of high-voltage electrical equipment [3]. With the development of electricity, the PD diagnosis is a useful tool for evaluation of the cast-resin transformer and prevention of the possible failures. It is essential to determine the different types of faults by PD diagnosis to estimate the likely defect type and severity. The use of PD pattern recognition can identify potential faults and inspect insulation defects from the measured data. Then, the potential effects are used to estimate the risk of insulation failure in high-voltage electrical equipment. This information is important to evaluate the risk of discharge in the insulation. PD pattern recognition in the past depended on expert judgments for classification and defect level determination. Such a process is unscientific and needs professional experience from years' practice.
To date, artificial intelligent techniques were adopted for pattern recognition and classification of PD. Mor et al. used the cross wavelet transform to perform automatic PD recognition [4]. The wavelet analysis has been regarded as a promising tool to denoising and fault diagnosis, however it is difficult to determine the composition level that yields the best result. Gu et al. proposed a fractional Fourier transform-based approach for gas-insulated switchgear PD recognition [5]. Ma et al. proposed a fractal theory-based PD recognition technique for medium-voltage motors [6]. However, some clusters of PD patterns are very close in the fractal map, which may result in incorrect identification.
As a more scientific approach, machine learning technique for PD recognition is utilized to bypass human errors [7].
There exist numerous machine learning techniques for the pattern recognition of PD such as the artificial neural network [8], clustering [9,10], support vector machine [11] and deep learning [12][13][14]. The artificial neural network constitutes an information processing model which contains empirical knowledge using a learning process. However, it is computationally expensive and lack of rules for determining the proper network structure. The clustering technique is set up based on the stream density and the clustering theory, however the zero-weight problem exists in the general clustering approach. The support vector machines belong to supervised learning techniques based on statistical learning theory which may be applied for PD pattern recognition, however the classification performance of SVM is conveniently affected by the setting of parameters. Deep learning was successfully applied in pattern recognition and image segmentation, however it is a challenging task due to the limited data availability.
The contribution of this work is to develop a fuzzy logic clustering decision tree (FLCDT) to classify the abnormal defects of cast-resin transformers. Fuzzy logic methods have been successfully applied to many applications in renewable energy. Liu et al. developed an ultra-short-time forecasting method based on the Takagi-Sugeno fuzzy model for wind power and wind speed [15]. In [16], an offline time series forecasting approach with an adaptive neuro-fuzzy inference system was conducted for electrical insulator fault forecast. Wang et al. proposed a fuzzy hybrid model to evaluate the energy policies and investments in renewable energy resources [17]. Thao et al. presented an improved interval fuzzy modeling technique to estimate solar photovoltaic, wind and battery power in a demonstrative renewable energy system under large data changes [18].
A 60-MVA cast resin transformer with a rated voltage of 22.8 kV is used in this study. The IEC 60,270 standard [19] is utilized to perform an off-line PD measurement on electrical equipment. The training dataset has three continuous attributes and three abnormal defects. Three continuous attributes are the number of discharge (n) over the chosen block, discharge magnitude (q) and the corresponding phase angle (φ) where PD pulses occur. Three abnormal defects are failure in S-phase cable termination, failure in R-phase cable and failure in T-phase cable termination. The FLCDT integrates a hierarchical clustering scheme with the decision tree. The hierarchical clustering scheme uses splitting attributes to divide the data set into suspended clusters according to a separation matrix and fuzzy rules. The suspended clusters consist of more than one pattern, which can be further classified by the decision tree [20].
In the remaining part of the study, the Section 2 is used to present the fuzzy logic clustering decision tree. Section 3 introduces the PD measurements of cast-resin transformers and describes the pattern recognition of PD. In Section 4, the FLCDT is applied to classify the aberrant PD of cast-resin transformers and compared with two software packages, See5 and CART. Finally, Section 5 makes a conclusion.

Motivation
Since the number of possible attributes and the number of classes are rather large, data mining techniques have been receiving increasing attention from the research community. For example, the fault detection of the ion implantation processes is a challenging issue in semiconductor fabrication because of the large number of wafer recipes. Fuzzy-rule-based classification algorithms [21,22] have received significant attention among researchers due to a finer fuzzy partition and good behavior in the real-time databases. These advantages may be suppressed if the number of attributes and number of classes become large, a finer partition of fuzzy subsets is required and results in a large size of the fuzzy-rule sets. To resolve this disadvantage, the main characteristic of the developed method is to divide the classes into specific clusters to accomplish a finer partition of fuzzy subsets. Figure 1 illustrates an eight-class example of cluster splitting, which is divided into four suspended clusters. In each cluster, the recognizability now is four times larger than the original structure. Thus, the approach not only can achieve higher classification accuracy, but also spend less computational complexity. In [16], an offline time series forecasting approach with an adaptive neuro-fuzzy inference system was conducted for electrical insulator fault forecast. Wang et al. proposed a fuzzy hybrid model to evaluate the energy policies and investments in renewable energy resources [17]. Thao et al. presented an improved interval fuzzy modeling technique to estimate solar photovoltaic, wind and battery power in a demonstrative renewable energy system under large data changes [18]. A 60-MVA cast resin transformer with a rated voltage of 22.8 kV is used in this study. The IEC 60,270 standard [19] is utilized to perform an off-line PD measurement on electrical equipment. The training dataset has three continuous attributes and three abnormal defects. Three continuous attributes are the number of discharge (n) over the chosen block, discharge magnitude (q) and the corresponding phase angle ( ) where PD pulses occur. Three abnormal defects are failure in S-phase cable termination, failure in R-phase cable and failure in T-phase cable termination. The FLCDT integrates a hierarchical clustering scheme with the decision tree. The hierarchical clustering scheme uses splitting attributes to divide the data set into suspended clusters according to a separation matrix and fuzzy rules. The suspended clusters consist of more than one pattern, which can be further classified by the decision tree [20].
In the remaining part of the study, the Section 2 is used to present the fuzzy logic clustering decision tree. Section 3 introduces the PD measurements of cast-resin transformers and describes the pattern recognition of PD. In Section 4, the FLCDT is applied to classify the aberrant PD of cast-resin transformers and compared with two software packages, See5 and CART. Finally, Section 5 makes a conclusion.

Motivation
Since the number of possible attributes and the number of classes are rather large, data mining techniques have been receiving increasing attention from the research community. For example, the fault detection of the ion implantation processes is a challenging issue in semiconductor fabrication because of the large number of wafer recipes. Fuzzy-rule-based classification algorithms [21,22] have received significant attention among researchers due to a finer fuzzy partition and good behavior in the real-time databases. These advantages may be suppressed if the number of attributes and number of classes become large, a finer partition of fuzzy subsets is required and results in a large size of the fuzzy-rule sets. To resolve this disadvantage, the main characteristic of the developed method is to divide the classes into specific clusters to accomplish a finer partition of fuzzy subsets. Figure 1 illustrates an eight-class example of cluster splitting, which is divided into four suspended clusters. In each cluster, the recognizability now is four times larger than the original structure. Thus, the approach not only can achieve higher classification accuracy, but also spend less computational complexity. Since the cluster can be further classified by data mining techniques, the concept of clustering of the proposed method is hierarchical. The hierarchical concept had been adopted fairly widely in various classification methods, including the hierarchical decision trees [23,24], hierarchical Bayesian Since the cluster can be further classified by data mining techniques, the concept of clustering of the proposed method is hierarchical. The hierarchical concept had been adopted fairly widely in various classification methods, including the hierarchical decision trees [23,24], hierarchical Bayesian networks [25,26] and hierarchical neural networks [27,28], to improve the computation time and accuracy of classification. Accordingly, the FLCDT scheme is proposed to achieve a finer fuzzy partition without expensive computation. The motivation of the FLCDT is to measure the distance between two classes of an attribute. A separability factor is used to decide whether the two classes belong to the same cluster or not. After performing the FLCDT, a cluster spanning tree containing a cluster leader and some suspended clusters will be constructed. A cluster leader is the root of the cluster spanning tree. The classes in any suspended cluster is much less than the cluster leader. The flow diagram of the FLCDT scheme is displayed in Figure 2. networks [25,26] and hierarchical neural networks [27,28], to improve the computation time and accuracy of classification. Accordingly, the FLCDT scheme is proposed to achieve a finer fuzzy partition without expensive computation. The motivation of the FLCDT is to measure the distance between two classes of an attribute. A separability factor is used to decide whether the two classes belong to the same cluster or not. After performing the FLCDT, a cluster spanning tree containing a cluster leader and some suspended clusters will be constructed. A cluster leader is the root of the cluster spanning tree. The classes in any suspended cluster is much less than the cluster leader. The flow diagram of the FLCDT scheme is displayed in Figure 2.
Determine the separation matrix based on the Chebyshev inequality Input the training data set Determine the priority of crucial attributes using Algorithm I Generate the fuzzy rules of each separable cluster using Algorithm II Construct the cluster spanning tree using Algorithm III Construct the decision tree of each suspended cluster using C4.5 algorithm

Separation Matrix Based on the Chebyshev Inequality
Since not all the attributes are indispensable to separate classes, a specified criterion can be used to select few critical and effective ones to split clusters. The attribute values for members in the given training data spread over a specific range with a particular probability density function. Thus, the overlapping degree of the attribute values is used to decide the separability between two classes. For instances, Figure 3 shows two classes i C and j C for the kth attribute are separable, while Figure 4 shows two classes are not separable.

Separation Matrix Based on the Chebyshev Inequality
Since not all the attributes are indispensable to separate classes, a specified criterion can be used to select few critical and effective ones to split clusters. The attribute values for members in the given training data spread over a specific range with a particular probability density function. Thus, the overlapping degree of the attribute values is used to decide the separability between two classes. For instances, Figure 3 shows two classes C i and C j for the kth attribute are separable, while Figure 4 shows two classes are not separable. networks [25,26] and hierarchical neural networks [27,28], to improve the computation time and accuracy of classification. Accordingly, the FLCDT scheme is proposed to achieve a finer fuzzy partition without expensive computation. The motivation of the FLCDT is to measure the distance between two classes of an attribute. A separability factor is used to decide whether the two classes belong to the same cluster or not. After performing the FLCDT, a cluster spanning tree containing a cluster leader and some suspended clusters will be constructed. A cluster leader is the root of the cluster spanning tree. The classes in any suspended cluster is much less than the cluster leader. The flow diagram of the FLCDT scheme is displayed in Figure 2.
Determine the separation matrix based on the Chebyshev inequality Input the training data set Determine the priority of crucial attributes using Algorithm I Generate the fuzzy rules of each separable cluster using Algorithm II Construct the cluster spanning tree using Algorithm III Construct the decision tree of each suspended cluster using C4.5 algorithm

Separation Matrix Based on the Chebyshev Inequality
Since not all the attributes are indispensable to separate classes, a specified criterion can be used to select few critical and effective ones to split clusters. The attribute values for members in the given training data spread over a specific range with a particular probability density function. Thus, the overlapping degree of the attribute values is used to decide the separability between two classes. For instances, Figure 3 shows two classes i C and j C for the kth attribute are separable, while Figure 4 shows two classes are not separable.     The separability factor is used to determine whether two classes i C and j C for the k th attribute are separable or not, which is defined as is calculated by the Chebyshev inequality [29]. Let represents the probability of ) ( , and  denotes the significance level, which is set to be 0.05. Based on the Chebyshev inequality [29], the value of Now, the separation matrix for the k th attribute is defined as ] ) , (

Divide Cluster
To select the classes which are belong to a same cluster, a separability graph according to the The separability factor is used to determine whether two classes C i and C j for the kth attribute are separable or not, which is defined as The value of S(C i , C j ) k is calculated by the Chebyshev inequality [29]. Let X k i denote the random variable for the kth attribute of class C i . We assume without loss of generality that µ k i < µ k j , where µ k i and σ k i represent the mean and standard deviation of X k i , respectively. Let a k i be a positive real value such that P X k represents the probability of (·), and α denotes the significance level, which is set to be 0.05. Based on the Chebyshev inequality [29], the value of a k i is set as where δ is a tiny positive real value. If µ k j is sufficiently greater than µ k i + a k i , the value of p i is very small, two classes C i and C j are more easily separable as illustrated in Figure 5. Thus, a threshold valuep can be used to determine the separation factor for two classes C i and C j . The separability factor is used to determine whether two classes i C and j C for the k th attribute are separable or not, which is defined as attribute  th  the  for  separable  are  and  if  ,  0  ) , ( is calculated by the Chebyshev inequality [29]. Let represents the probability of ) ( , and  denotes the significance level, which is set to be 0.05. Based on the Chebyshev inequality [29], the value of , the value of i p is very small, two classes i C and j C are more easily separable as illustrated in Figure 5. Thus, a threshold value p can be used to determine the separation factor for two classes i C and j C .
Now, the separation matrix for the k th attribute is defined as ] ) , (

Divide Cluster
To select the classes which are belong to a same cluster, a separability graph according to the separability matrix ] ) , (

Divide Cluster
To select the classes which are belong to a same cluster, a separability graph according to the separability matrix [S(C i , C j ) k ] is constructed. Regarding a class as a node, [S(C i , C j ) k ] is treated as an incidence matrix of the kth attribute. If S(C i , C j ) k = 1, two nodes C i and C j are connected by an arc. The separability graph contains several disjoint connectivity sub-graphs. A connectivity sub-graph indicates a cluster, and the amount of disjoint connectivity sub-graphs is the number of suspended clusters which are obtained by the kth attribute. For example, Figure 6 shows a separability graph, which is constructed according to the separability matrix shown in Figure 7. The separability graph has two clusters, the first one comprises classes 1, 2, 3 and 4, and the other comprises classes 5, 6, 7 and 8.
Energies 2020, 13, x FOR PEER REVIEW 6 of 19 an incidence matrix of the kth attribute.
, two nodes i C and j C are connected by an arc. The separability graph contains several disjoint connectivity sub-graphs. A connectivity subgraph indicates a cluster, and the amount of disjoint connectivity sub-graphs is the number of suspended clusters which are obtained by the k th attribute. For example, Figure 6 shows a separability graph, which is constructed according to the separability matrix shown in Figure 7. The separability graph has two clusters, the first one comprises classes 1, 2, 3 and 4, and the other comprises classes 5, 6, 7 and 8. Figure 6. Separability graph.

Selection of Crucial Attributes
It is possible that all classes are not separable using an attribute. The separability graph may be a connectivity graph using this attribute. Thus, an attribute which can divide all classes into at least two clusters is defined as a crucial attribute (CA). Since there are several CAs in the training data, a disjoint cluster obtained using some CA can be further divide using other CA. This is the reason that we claim the proposed cluster splitting is a hierarchical cluster splitting. Because the priority of CAs utilized to split the classes will influence the classification accuracy, we describe the procedures of the hierarchical cluster splitting as below. First, the set of overall classes is defined as the cluster leader 0 Cr . After successively applying two CAs, say CA 1 and CA 2 , to  Energies 2020, 13, x FOR PEER REVIEW 6 of 19 an incidence matrix of the kth attribute.
, two nodes i C and j C are connected by an arc. The separability graph contains several disjoint connectivity sub-graphs. A connectivity subgraph indicates a cluster, and the amount of disjoint connectivity sub-graphs is the number of suspended clusters which are obtained by the k th attribute. For example, Figure 6 shows a separability graph, which is constructed according to the separability matrix shown in Figure 7. The separability graph has two clusters, the first one comprises classes 1, 2, 3 and 4, and the other comprises classes 5, 6, 7 and 8. Figure 6. Separability graph.

Selection of Crucial Attributes
It is possible that all classes are not separable using an attribute. The separability graph may be a connectivity graph using this attribute. Thus, an attribute which can divide all classes into at least two clusters is defined as a crucial attribute (CA). Since there are several CAs in the training data, a disjoint cluster obtained using some CA can be further divide using other CA. This is the reason that we claim the proposed cluster splitting is a hierarchical cluster splitting. Because the priority of CAs utilized to split the classes will influence the classification accuracy, we describe the procedures of the hierarchical cluster splitting as below. First, the set of overall classes is defined as the cluster leader 0 Cr . After successively applying two CAs, say CA 1 and CA 2 , to

Selection of Crucial Attributes
It is possible that all classes are not separable using an attribute. The separability graph may be a connectivity graph using this attribute. Thus, an attribute which can divide all classes into at least two clusters is defined as a crucial attribute (CA). Since there are several CAs in the training data, a disjoint cluster obtained using some CA can be further divide using other CA. This is the reason that we claim the proposed cluster splitting is a hierarchical cluster splitting. Because the priority of CAs utilized to split the classes will influence the classification accuracy, we describe the procedures of the hierarchical cluster splitting as below. First, the set of overall classes is defined as the cluster leader Cr 0 . After successively applying two CAs, say CA 1 and CA 2 , to Cr 0 , the connectivity is resulted from the conjunction operation of , where k 1 and k 2 represent the selecting attribute of CA 1 and CA 2 , respectively. The conjunction operation of two matrices is defined as the Figure 8 displays a typical cluster spanning tree of m CAs, where CA i represents the CA used in the ith level, L is the number of clusters in the first level, and n L denotes the number of clusters in the second level of the cluster C L . The suspended cluster (SC) is a cluster obtained from the last CA in the CA priority sequence or contains only one class.
For any priority sequence of CAs, the number of SCs in the cluster spanning tree are the same. However, an improper splitting of former clusters will affect the accuracy of the latter cluster splitting along the path of cluster spanning tree. For example, Figures 9 and 10 show the separability matrices of a classification problem with 8 classes, C 1~C8 and two attributes, k 1 and k 2 . If the k 1 attribute is used first to split the 8 classes in Figure 9, there are three clusters after splitting. One comprises 1 class, and the other two comprise 3 and 4 classes. If the k 2 attribute is used first to split the 8 classes in Figure 10, there are four clusters after splitting and each cluster contain 2 classes. The k 2 attribute is chosen to split the cluster leader because it results in more SCs. For any priority sequence of CAs, the number of SCs in the cluster spanning tree are the same. However, an improper splitting of former clusters will affect the accuracy of the latter cluster splitting along the path of cluster spanning tree. For example, Figures 9 and 10 show the separability matrices of a classification problem with 8 classes, C1~C8 and two attributes, 1 k and 2 k . If the 1 k attribute is used first to split the 8 classes in Figure 9, there are three clusters after splitting. One comprises 1 class, and the other two comprise 3 and 4 classes. If the 2 k attribute is used first to split the 8 classes in Figure 10, there are four clusters after splitting and each cluster contain 2 classes. The 2 k attribute is chosen to split the cluster leader because it results in more SCs.   For any priority sequence of CAs, the number of SCs in the cluster spanning tree are the same. However, an improper splitting of former clusters will affect the accuracy of the latter cluster splitting along the path of cluster spanning tree. For example, Figures 9 and 10 show the separability matrices of a classification problem with 8 classes, C1~C8 and two attributes, 1 k and 2 k . If the 1 k attribute is used first to split the 8 classes in Figure 9, there are three clusters after splitting. One comprises 1 class, and the other two comprise 3 and 4 classes. If the 2 k attribute is used first to split the 8 classes in Figure 10, there are four clusters after splitting and each cluster contain 2 classes. The 2 k attribute is chosen to split the cluster leader because it results in more SCs.  For any priority sequence of CAs, the number of SCs in the cluster spanning tree are the same. However, an improper splitting of former clusters will affect the accuracy of the latter cluster splitting along the path of cluster spanning tree. For example, Figures 9 and 10 show the separability matrices of a classification problem with 8 classes, C1~C8 and two attributes, 1 k and 2 k . If the 1 k attribute is used first to split the 8 classes in Figure 9, there are three clusters after splitting. One comprises 1 class, and the other two comprise 3 and 4 classes. If the 2 k attribute is used first to split the 8 classes in Figure 10, there are four clusters after splitting and each cluster contain 2 classes. The 2 k attribute is chosen to split the cluster leader because it results in more SCs.  Figure 10.
To describe the criterion, we define L k and n k,l (Cr j ) as the amount of SCs and the amount of classes in the lth SC obtained by the kth attribute to divide the cluster Cr j , respectively. The criterion for selecting the attribute k to divide Cr j is where n k,l Cr j = L k l=1 n k,l Cr j /L k is the average amount of classes in the obtained SCs. Obviously, the attribute will result in more SCs if it has a smaller variation concerning the number of classes in the SCs. This attribute is the CA that we seek. Consider the separability matrices shown in Figures 9 Energies 2020, 13, 2546 8 of 19 and 10, if the k 1 attribute is used to divide the cluster first, there are three SCs. One comprises one class the other two comprise three and four classes. The value of υ k 1 (Cr 0 ) is 1.11. If the k 2 attribute is used first, there are four SCs and each SC comprises two classes. The value of υ k 2 (Cr 0 ) is 0.25. Since the value of υ k 2 (Cr 0 ) is the smallest, the k 2 attribute is chosen to split Cr 0 . Now, the algorithm (Algorithm 1) to determine the priority of CAs for constructing the cluster spanning tree is described below.

Algorithm 1: Determine the priority of CAs
Step 1: Use the training data set to calculate the separation matrix for each attribute. Configure the set of Non-Split Clusters (NSC) = {Cr 0 }. Step 2: Determine the splitting attribute k according to equation (3) for each cluster in NSC, then use this attribute to divide the clusters and move these SCs into NSC.
Step 3: Remove the clusters which was divided and those cannot be divided by any attribute.

The Hierarchical Clustering Scheme
The hierarchical clustering scheme has two phases: the training phase for generating the fuzzy logic rules and the classifying phase to classify a new data pattern. In the training phase, a data set with predetermined SCs is given. The fuzzy logic rules are generated according to the given data patterns. In the classifying phase, a fuzzy inference mechanism is utilized to classify an unknown data pattern according to the fuzzy logic rules.

The Fuzzy Rules Generation
Consider a given training data set for a non-SC cluster Cr j in the cluster spanning tree, an attribute k i can split the cluster into SCs. The g given data patterns for attribute k i are denoted as x p k i , p = 1, . . . , g, with M known SCs, SCr j1 , . . . , SCr jM . These g data patterns are trained to split the non-SC cluster Cr j . The fuzzy if-then rule [30,31] is defined as follows.
belongs to SCr ji with CF I i , where I denotes the amount of fuzzy subsets, A I i denotes the ith fuzzy subset, i = 1, . . . , I, SCr ji represents the consequent, which is one of the M SCs and CF I i denotes the certainty grade of rule R i . Let µ i (·) represent the membership function of (·) with respect to the fuzzy subset A I i . Therefore, µ i (x p k i ) can be treated as a compatibility grade of x p k i corresponding to A I i . Define as the sum of compatibility grade for SCr jl corresponding to A I i . The generation of fuzzy logic rules to split cluster Cr j is summarized as follows.
The step-wise process of the Algorithm 2 is given below.

of 19
Algorithm 2: Generate the fuzzy rules Step 1: Given the g training data x p k i , p = 1, . . . , g and the splitting attribute k i for cluster Cr j with M known SCr jm , m = 1, . . . , M and set i = 1.
Step 3: Determine the SCr jx with the maximum sum of compatibility grade Step 4: Calculate the certainty grade CF i of rule R i where β(R i ) = SCr jm SCr jx β SCr jm (R i )/(M − 1) denotes the mean of the sum of compatibility grade for the rest SCs corresponding to A I i .
Step 5: If i = I, then stop; else, set i = i + 1 and go to Step 2.
The hierarchical clustering scheme (Algorithm 3) is summarized as follows.
Step 1: Calculate µ k i and σ k i and determine [S(C i , C j ) k ] for attribute k.
Step 2: Use Algorithm I to determine the priority sequence of CAs for constructing the cluster spanning tree.
Step 3: Apply Algorithm II to create the fuzzy if-then rules.

The Classification Processes
After creation of the fuzzy if-then rules for each cluster, we can identify a new data pattern to a suitable SC. Let x k j represent the k j attribute value of a new data pattern at cluster Cr j . The weighted certainty grade of x k j corresponding to the SCr jm is defined as α SCr jm = R i µ(x k j ) · CF I i , which sum of the multiplication of the compatibility grade of x k j corresponding to A I i and the certainty grade of all fuzzy rules R i . Therefore, the classification processes are stated below.
Classification Processes: The SC has the maximum weighted certainty grade of x k j is the desired cluster SCr jl , i.e., SCr jl = arg(max α SCr j1 , . . . , α SCr jM ). The step-wise procedure of the Algorithm 4 is explained below.

Algorithm 4: Classification
Step 1: Configure Current Cluster (CCr)=Cr 0 and given the new data pattern x .
Step 2: The cluster SCr jl with maximum weighted certainty grade of x k j is the desired cluster of x .
Step 3: Classify x into a SC (SCr jl ). If the SCr jl is not a SC, then set CCr=SCr jl and repeat step 3; else, stop.

Classify the Suspended Cluster Using C4.5
Decision trees is one of the more popular classification algorithms being used in classification problems, which provides a good visualization that helps in decision making. The entropy-based algorithms which build multi-way decision trees, such as ID3 and C4.5 [32], are the most commonly used classification models designed for structured data. The Gini index based crisp decision tree algorithms, such as CART [33], Quest [34] and SLIQ [35], applies a numerical splitting criterion to build binary decision trees. C4.5 utilizes a minimum number of significant rules and some minor rules for classification. C4.5 has the characteristic of the instability such that few variations of data can produce significant differences on the model [20]. However, the run-time complexity of the C4.5 corresponds to the tree depth, which is related to the number of training examples. To overcome the drawback of the C4.5, a hierarchical clustering scheme is utilized as a preprocessing stage for classification. The whole data set is divided by the hierarchical clustering scheme into a SC and the patterns in the SC is classified using the C4.5. Since the number of patterns in the SC is reduced, the run-time complexity of the C4.5 can be resolved.
C4.5 is also composed of training phase and classifying phase. The goal of training phase is to construct a decision tree and determine the splitting condition in each node. The critical attribute with the largest gain ratio is chosen as the splitting attribute to make the decision. C4.5 prunes trees after creation in an attempt to discard branches that are not helpful and replaces them with leaf nodes.
The mathematical basis of the C4.5 is described below. Let K denote the number of attributes and T = x 1 , x 2 , . . . , x g denote the given training data set, where x g = (x g 1 , x g 2 , . . . , x g K ) is a data pattern and x g k denotes the kth attribute value of x g . Let N denote the number of classes and C i denote the ith class. The probability of a data pattern selected from T which belongs to C i is p i = |C i | |T| , where |C i | denote the amount of data patterns in C i . The information conveyed by a probability distribution P = (p 1 , p 2 , . . . , p N ) is called the entropy, which is defined as The value of I(T) measures the uncertainty associated with the probability distribution. The expected information requirement to partition T into n subsets is where T 1 , T 2 , . . . , T n denote the partition of T using the kth attribute, a k . The value of G(a k , T) represents the expected reduction in entropy due to sorting on a k , which is defined as C4.5 chooses the splitting attribute based on the gain ratio R(a k , T), which is defined as follows.
where SI(a k , T) is the split information, which can be obtained by The partition values of a continuous attribute a k are first, arranged in ascending order, a 1 k , a 2 k , . . . , a m k . For each partition value a j k , j = 1, 2, . . . , m, the data patterns are partitioned into two sets. The first one contains the values less than or equal to a j k and the other one contains the parts greater than a j k . We compute the R(a j k , T) for each partition value a j k , then select the best partition value such that the gain ratio is maximized.  Figure 11 shows a typical 2D PD patterns, where the horizontal axis represents the discharge phase angle ranging from 0 •~3 60 • , the vertical axis represents the size of the discharge ranging from 0 pC~60 pC and the point is the discharge signal.

Matrix Transformation of 3D PD Patterns
Energies 2020, 13, x FOR PEER REVIEW 11 of 19 Figure 11 shows a typical 2D PD patterns, where the horizontal axis represents the discharge phase angle ranging from 0°~360°, the vertical axis represents the size of the discharge ranging from 0 pC~60 pC and the point is the discharge signal. Figure 11. Typical 2D partial discharges (PD) pattern. Figure 12 shows a typical 3D PD pattern. The key attributes of typical 3D PD patterns include phase angle ( ), discharge magnitude (q) and number of discharges (n). In the data sets, the format of different categories may not be the same as expected. To meet the data formulation of FLCDT, data transformation for 3D PD pattern is necessary. Figure 13 shows the three steps of data transformation for 3D PD pattern. In step 1, the 3D PD pattern is transformed into a 360 × 60 matrix, where the row index indicates the phase angle and column index indicates discharge magnitude and the elements on the matrix is the number of discharges. In step 2, the original sparse matrix is compressed into a dense matrix after removing all the zero elements in each row. In step 3, feature vectors of the 3D PD pattern are extracted from the dense matrix. Each feature vector also consists of three key attributes, which are phase angle, discharge magnitude and number of discharges. Thus, the dimension of a feature vector is 3. For example, the first and last feature vector for the 3D PD pattern shown in Figure  13 are [19,74,22] and [301, 15,25], respectively.   Figure 12 shows a typical 3D PD pattern. The key attributes of typical 3D PD patterns include phase angle (φ), discharge magnitude (q) and number of discharges (n). In the data sets, the format of different categories may not be the same as expected. To meet the data formulation of FLCDT, data transformation for 3D PD pattern is necessary. Figure 13 shows the three steps of data transformation for 3D PD pattern. In step 1, the 3D PD pattern is transformed into a 360 × 60 matrix, where the row index indicates the phase angle and column index indicates discharge magnitude and the elements on the matrix is the number of discharges. In step 2, the original sparse matrix is compressed into a dense matrix after removing all the zero elements in each row. In step 3, feature vectors of the 3D PD pattern are extracted from the dense matrix. Each feature vector also consists of three key attributes, which are phase angle, discharge magnitude and number of discharges. Thus, the dimension of a feature vector is 3. For example, the first and last feature vector for the 3D PD pattern shown in Figure 13 are [19,74,22] and [301, 15,25], respectively.  Figure 11 shows a typical 2D PD patterns, where the horizontal axis represents the discharge phase angle ranging from 0°~360°, the vertical axis represents the size of the discharge ranging from 0 pC~60 pC and the point is the discharge signal. Figure 11. Typical 2D partial discharges (PD) pattern. Figure 12 shows a typical 3D PD pattern. The key attributes of typical 3D PD patterns include phase angle ( ), discharge magnitude (q) and number of discharges (n). In the data sets, the format of different categories may not be the same as expected. To meet the data formulation of FLCDT, data transformation for 3D PD pattern is necessary. Figure 13 shows the three steps of data transformation for 3D PD pattern. In step 1, the 3D PD pattern is transformed into a 360 × 60 matrix, where the row index indicates the phase angle and column index indicates discharge magnitude and the elements on the matrix is the number of discharges. In step 2, the original sparse matrix is compressed into a dense matrix after removing all the zero elements in each row. In step 3, feature vectors of the 3D PD pattern are extracted from the dense matrix. Each feature vector also consists of three key attributes, which are phase angle, discharge magnitude and number of discharges. Thus, the dimension of a feature vector is 3. For example, the first and last feature vector for the 3D PD pattern shown in Figure  13 are [19,74,22] and [301, 15,25], respectively.   Step 1: Transformation

Matrix Transformation of 3D PD Patterns
Step 2: Compression Step 3: Extraction Figure 13. Three steps of data transformation for 3D PD pattern.

3D PD patterns characteristics
There are four kinds of PD patterns used in this work, which are failure in S-phase cable termination, failure in R-phase cable, failure in T-phase cable termination and normal operation. Figure 14 shows the 3D PD pattern of failure in S-phase cable termination. Most of the discharges are between 50-70 pC. Figure 15 shows the 3D PD pattern of failure in R-phase cable. Most of the discharges are between 20-55 pC and the phase angle is widely distributed. Figure 16 shows the 3D PD pattern of failure in T-phase cable termination. Most of the discharges are between 10-35 pC. Figure 17 shows the 3D PD pattern of normal operation. Most of the discharges are between 10-25 pC. After applying the three steps of data transformation for 3D PD pattern, we can obtain the feature vectors of the corresponding 3D PD pattern. Then, the Algorithm I is utilized to determine the priority of CAs using the training feature vectors to construct the cluster spanning tree.

3D PD Patterns Characteristics
There are four kinds of PD patterns used in this work, which are failure in S-phase cable termination, failure in R-phase cable, failure in T-phase cable termination and normal operation. Figure 14 shows the 3D PD pattern of failure in S-phase cable termination. Most of the discharges are between 50-70 pC. Figure 15 shows the 3D PD pattern of failure in R-phase cable. Most of the discharges are between 20-55 pC and the phase angle is widely distributed. Figure 16 shows the 3D PD pattern of failure in T-phase cable termination. Most of the discharges are between 10-35 pC. Figure 17 shows the 3D PD pattern of normal operation. Most of the discharges are between 10-25 pC. After applying the three steps of data transformation for 3D PD pattern, we can obtain the feature vectors of the corresponding 3D PD pattern. Then, the Algorithm I is utilized to determine the priority of CAs using the training feature vectors to construct the cluster spanning tree.          . Figure 17. Normal operation of the equipment.

Experiment Results and Comparison
This work uses data collected by a well-known foundry company in Taiwan. A PD measurement based on the IEC 60,270 standard was performed on a 60-MVA cast resin transformer with a rated voltage of 22.8 kV. Three RF sensors are installed near the surfaces of the power transformer to detect the PD signals. The positions of RF sensors are adjusted to obtain the same performance. Three phase voltages are obtained from voltage output. Phase voltage and three PD signals are connected to a 4channel oscilloscope to identify where the PD occurs. The R-S-T sensors capture the PD signal and send them to the scope through three wideband RF cables. The phase voltages are adjusted to measure the PD from the power transformer. Table 1 shows the three attributes used in the PD detection, which are phase angle (  ), discharge magnitude (q) and number of discharges (n). Table 2 lists the four classes of PD patterns, which are failure in S-phase cable termination, failure in R-phase cable, failure in T-phase cable termination and normal operation. Three cable defects were created artificially on the cable prior to the cable joints installation. Each PD pattern is experimented on 40 times. In total, this experiment produced 160 sets of PD patterns, 128 of which are for training and 32 of which are for testing. Each class has 32 training patterns and 8 testing patterns. After three steps of data transformation, 84,368 feature vectors were used for training and 21,092 feature vectors were used for testing. After applying Algorithm I, the CA utilized to split the root cluster is the charge pC. Three threshold values p = 0.5, 0.7 and 0.9 were used in Algorithm III. The FLCDT was compared with two software packages, See5 and CART. See5 is a data mining tool to extract informative patterns from data and assemble them into classifiers to make predictions [36]. See5 is developed based on the C4.5 to operate on large databases and incorporate innovations such as boosting. The classification and regression tree (CART) in the classification toolbox for MATLAB was utilized to compare the accuracy [37]. CART selects the best decision split that maximizes the improvement in Gini index over all possible splits of all predictors. Table 1. Three attributes used in the PD pattern recognition.

Notation
Attribute k1 Phase angle k2 Charge pC k3 Cycle Number

Experiment Results and Comparison
This work uses data collected by a well-known foundry company in Taiwan. A PD measurement based on the IEC 60,270 standard was performed on a 60-MVA cast resin transformer with a rated voltage of 22.8 kV. Three RF sensors are installed near the surfaces of the power transformer to detect the PD signals. The positions of RF sensors are adjusted to obtain the same performance. Three phase voltages are obtained from voltage output. Phase voltage and three PD signals are connected to a 4-channel oscilloscope to identify where the PD occurs. The R-S-T sensors capture the PD signal and send them to the scope through three wideband RF cables. The phase voltages are adjusted to measure the PD from the power transformer. Table 1 shows the three attributes used in the PD detection, which are phase angle (φ), discharge magnitude (q) and number of discharges (n). Table 2 lists the four classes of PD patterns, which are failure in S-phase cable termination, failure in R-phase cable, failure in T-phase cable termination and normal operation. Three cable defects were created artificially on the cable prior to the cable joints installation. Each PD pattern is experimented on 40 times. In total, this experiment produced 160 sets of PD patterns, 128 of which are for training and 32 of which are for testing. Each class has 32 training patterns and 8 testing patterns. After three steps of data transformation, 84,368 feature vectors were used for training and 21,092 feature vectors were used for testing. After applying Algorithm I, the CA utilized to split the root cluster is the charge pC. Three threshold valuesp = 0.5, 0.7 and 0.9 were used in Algorithm III. The FLCDT was compared with two software packages, See5 and CART. See5 is a data mining tool to extract informative patterns from data and assemble them into classifiers to make predictions [36]. See5 is developed based on the C4.5 to operate on large databases and incorporate innovations such as boosting. The classification and regression tree (CART) in the classification toolbox for MATLAB was utilized to compare the accuracy [37]. CART selects the best decision split that maximizes the improvement in Gini index over all possible splits of all predictors.   Figure 18 shows the cluster spanning tree and the corresponding CA, where a block represents a cluster and the classes are displayed inside the parenthesis in each cluster. The CA is listed above the outgoing branch. There are two SCs in the cluster spanning tree forp = 0.5, where each SC consists of two patterns. There are three SCs in the cluster spanning tree forp = 0.7 and 0.9, where SC 3 consists of two patterns. Finally, the C4.5 algorithm is applied to SC 3 and construct the decision tree. Figure 19 displays the decision tree of SC 3 , which consists of patterns 3 and 4. Two attributes including phase angle and charge pC are utilized in the decision tree of SC 3 . Since the attribute values of cycle number has a higher overlapping degree, different classes in a dataset are not easily separable. Thus, attribute of cycle number is never used in the cluster spanning tree and decision tree of SC 3.  Normal operation Figure 18 shows the cluster spanning tree and the corresponding CA, where a block represents a cluster and the classes are displayed inside the parenthesis in each cluster. The CA is listed above the outgoing branch. There are two SCs in the cluster spanning tree for p = 0.5, where each SC consists of two patterns. There are three SCs in the cluster spanning tree for p = 0.7 and 0.9, where SC3 consists of two patterns. Finally, the C4.5 algorithm is applied to SC3 and construct the decision tree. Figure 19 displays the decision tree of SC3, which consists of patterns 3 and 4. Two attributes including phase angle and charge pC are utilized in the decision tree of SC3. Since the attribute values of cycle number has a higher overlapping degree, different classes in a dataset are not easily separable. Thus, attribute of cycle number is never used in the cluster spanning tree and decision tree of SC3.  Figure 20 shows the pattern distributions of the 21,092 testing feature vectors. In Figure 20, '○' represents the failure in S-phase cable termination (pattern 1), '□' represents the failure in R-phase cable (pattern 2), 'Δ' represents the failure in T-phase cable termination (pattern 3), '☆' represents the normal operation of the equipment (pattern 4). From the pattern distributions, it is clear that three SCs can be classified using the charge pC (k2), and pattern 3 and 4 can be classified using the phase angle (k1) and the charge pC (k2).  Normal operation Figure 18 shows the cluster spanning tree and the corresponding CA, where a block represents a cluster and the classes are displayed inside the parenthesis in each cluster. The CA is listed above the outgoing branch. There are two SCs in the cluster spanning tree for p = 0.5, where each SC consists of two patterns. There are three SCs in the cluster spanning tree for p = 0.7 and 0.9, where SC3 consists of two patterns. Finally, the C4.5 algorithm is applied to SC3 and construct the decision tree. Figure 19 displays the decision tree of SC3, which consists of patterns 3 and 4. Two attributes including phase angle and charge pC are utilized in the decision tree of SC3. Since the attribute values of cycle number has a higher overlapping degree, different classes in a dataset are not easily separable. Thus, attribute of cycle number is never used in the cluster spanning tree and decision tree of SC3.
Cr 0 (1~4)  Figure 20 shows the pattern distributions of the 21,092 testing feature vectors. In Figure 20, '○' represents the failure in S-phase cable termination (pattern 1), '□' represents the failure in R-phase cable (pattern 2), 'Δ' represents the failure in T-phase cable termination (pattern 3), '☆' represents the normal operation of the equipment (pattern 4). From the pattern distributions, it is clear that three SCs can be classified using the charge pC (k2), and pattern 3 and 4 can be classified using the phase angle (k1) and the charge pC (k2).    R-phase cable  3  Failure in T-phase cable termination  4 Normal operation Figure 18 shows the cluster spanning tree and the corresponding CA, where a block represents a cluster and the classes are displayed inside the parenthesis in each cluster. The CA is listed above the outgoing branch. There are two SCs in the cluster spanning tree for p = 0.5, where each SC consists of two patterns. There are three SCs in the cluster spanning tree for p = 0.7 and 0.9, where SC3 consists of two patterns. Finally, the C4.5 algorithm is applied to SC3 and construct the decision tree. Figure 19 displays the decision tree of SC3, which consists of patterns 3 and 4. Two attributes including phase angle and charge pC are utilized in the decision tree of SC3. Since the attribute values of cycle number has a higher overlapping degree, different classes in a dataset are not easily separable. Thus, attribute of cycle number is never used in the cluster spanning tree and decision tree of SC3.    Figure 20 shows the pattern distributions of the 21,092 testing feature vectors. In Figure 20, '○' represents the failure in S-phase cable termination (pattern 1), '□' represents the failure in R-phase cable (pattern 2), 'Δ' represents the failure in T-phase cable termination (pattern 3), '☆' represents the normal operation of the equipment (pattern 4). From the pattern distributions, it is clear that three SCs can be classified using the charge pC (k2), and pattern 3 and 4 can be classified using the phase angle (k1) and the charge pC (k2).
' represents the normal operation of the equipment (pattern 4). From the pattern distributions, it is clear that three SCs can be classified using the charge pC (k 2 ), and pattern 3 and 4 can be classified using the phase angle (k 1 ) and the charge pC (k 2 ). The classification precision of FLCDT was compared with the existing software CART and See5. The classification precision is defined as the number of correctly classified patterns to the total number of patterns. Table 3 shows the resulting classification precisions of four patterns, training time and classification time. Consider the three threshold values, we found that case ' p = 0.5′ resulted in a smaller classification precision, while the results of other two cases are the same. Since a larger threshold value p allows a higher overlapping degree, two classes are more easily separable. The classification precisions, training times and classification times obtained by the software CART and See5 are also shown in Table 3. Test results show that the FLCDT with p = 0.7 and p = 0.9 performs better than CART and See5 for classification precisions. The reason is that overfitting arises when the decision trees are directly applied to the training data set. Overfitting happens when a decision tree is excessively dependent on irrelevant features of the training data so that its predictive ability for untrained data is reduced. For patterns 1 and 4, See5 has a better performance than CART. Furthermore, the training time required by FLCDT is much shorter than those required by CART and See5. The FLCDT not only performs better than CART and See5 in the aspect of classification precision, but also requires less training time. This also reveals that the hierarchical clustering scheme helps reduce the time complexity of C4.5 algorithm. Figure 21 shows the confusion matrix of four patterns. The confusion matrix shows that all the measurements belonging to pattern 1 are classified correctly. For pattern 2, 12.5% of the data measurement are misclassified into pattern 3. In addition, 12.5% of the data measurement known to be in pattern 3 are misclassified into pattern 4. For pattern 4, 12.5% of the data measurements are misclassified into pattern 2 and 3, respectively. Table 4 shows the classification recall, precision, F-score and the average results of four patterns using FLCDT with p = 0.7. The overall accuracy of the FLCDT with p = 0.7 is 87.5%. Currently, there is no way to plot a ROC curve for multi-class classification problems as it is defined only for binary class classification. The ROC-AUC score for considered problem is not provided in this work. The classification precision of FLCDT was compared with the existing software CART and See5. The classification precision is defined as the number of correctly classified patterns to the total number of patterns. Table 3 shows the resulting classification precisions of four patterns, training time and classification time. Consider the three threshold values, we found that case 'p = 0.5 resulted in a smaller classification precision, while the results of other two cases are the same. Since a larger threshold valuep allows a higher overlapping degree, two classes are more easily separable. The classification precisions, training times and classification times obtained by the software CART and See5 are also shown in Table 3. Test results show that the FLCDT withp = 0.7 andp = 0.9 performs better than CART and See5 for classification precisions. The reason is that overfitting arises when the decision trees are directly applied to the training data set. Overfitting happens when a decision tree is excessively dependent on irrelevant features of the training data so that its predictive ability for untrained data is reduced. For patterns 1 and 4, See5 has a better performance than CART. Furthermore, the training time required by FLCDT is much shorter than those required by CART and See5. The FLCDT not only performs better than CART and See5 in the aspect of classification precision, but also requires less training time. This also reveals that the hierarchical clustering scheme helps reduce the time complexity of C4.5 algorithm. Figure 21 shows the confusion matrix of four patterns. The confusion matrix shows that all the measurements belonging to pattern 1 are classified correctly. For pattern 2, 12.5% of the data measurement are misclassified into pattern 3. In addition, 12.5% of the data measurement known to be in pattern 3 are misclassified into pattern 4. For pattern 4, 12.5% of the data measurements are misclassified into pattern 2 and 3, respectively. Table 4 shows the classification recall, precision, F-score and the average results of four patterns using FLCDT withp = 0.7. The overall accuracy of the FLCDT withp = 0.7 is 87.5%. Currently, there is no way to plot a ROC curve for multi-class classification problems as it is defined only for binary class classification. The ROC-AUC score for considered problem is not provided in this work.  Predicted pattern

Conclusions
PD diagnosis is a useful tool for evaluating insulation condition of the transformer and prevention of the possible failures. Classification of different types of PDs is import for the diagnosis of the quality of high-voltage electrical equipment. In this work, a fuzzy logic clustering decision tree (FLCDT) is proposed to classify the aberrant PD of cast-resin transformers. The proposed method integrates a hierarchical clustering scheme with the decision tree. The FLCDT not only consumes less training time, but also improves the classification precision. PD measurements based on the IEC 60,270 standard were performed on a 60-MVA cast resin transformer with a rated voltage of 22.8 kV.
The test dataset has three continuous attributes and three abnormal defects. Test results demonstrate that the FLCDT performs better than the CART and See5 with respect to the classification accuracies. Accordingly, the proposed FLCDT can serve as an effective abnormality detection of cast-resin transformers where real-time processing of data is required. Future research will focus on the application of the proposed method to resolve complicated fault detection problems, such as the incipient winding and core deformations of power transformers, linear induction motors and brushless direct current motors.

Conclusions
PD diagnosis is a useful tool for evaluating insulation condition of the transformer and prevention of the possible failures. Classification of different types of PDs is import for the diagnosis of the quality of high-voltage electrical equipment. In this work, a fuzzy logic clustering decision tree (FLCDT) is proposed to classify the aberrant PD of cast-resin transformers. The proposed method integrates a hierarchical clustering scheme with the decision tree. The FLCDT not only consumes less training time, but also improves the classification precision. PD measurements based on the IEC 60,270 standard were performed on a 60-MVA cast resin transformer with a rated voltage of 22.8 kV. The test dataset has three continuous attributes and three abnormal defects. Test results demonstrate that the FLCDT performs better than the CART and See5 with respect to the classification accuracies. Accordingly, the proposed FLCDT can serve as an effective abnormality detection of cast-resin transformers where real-time processing of data is required. Future research will focus on the application of the proposed method to resolve complicated fault detection problems, such as the incipient winding and core deformations of power transformers, linear induction motors and brushless direct current motors.