Statistical Feature Extraction Combined with Generalized Discriminant Component Analysis Driven SVM for Fault Diagnosis of HVDC GIS

: Accurately identifying the types of insulation defects inside a gas-insulated switchgear (GIS) is of great signiﬁcance for guiding maintenance work as well as ensuring the safe and stable operation of GIS. By building a set of 220 kV high-voltage direct current (HVDC) GIS experiment platforms and manufacturing four different types of insulation defects (including multiple sizes and positions), 180,828 pulse current signals under multiple voltage levels are successfully measured. Then, the apparent discharge quantity and the discharge time, two inherent physical quantities unaffected by the experimental platform and measurement system, are obtained after the pulse current signal is denoised, according to which 70 statistical features are extracted. In this paper, a pattern recognition method based on generalized discriminant component analysis driven support vector machine (SVM) is detailed and the corresponding selection criterion of involved parameters is established. The results show that the newly proposed pattern recognition method greatly improves the recognition accuracy of fault diagnosis in comparison with 36 kinds of state-of-the-art dimensionality reduction algorithms and 44 kinds of state-of-the-art classiﬁers. This newly proposed method not only solves the difﬁculty that phase-resolved partial discharge (PRPD) cannot be applied under DC condition but also immensely facilitates the fault diagnosis of HVDC GIS.


Introduction
At present, the power grid is developing towards the direction of high voltage, large capacity and intensification, and the power supply reliability requirement is gradually improved. In this context, the gas insulated switchgear (GIS), with fully enclosed structures, is increasingly being used by power grids at different levels [1]. Meanwhile, due to the increasing use of offshore wind energy and therefore increased demand for energy, transmission to onshore is required. For this energy collection, offshore platforms are needed where space is very expensive, and DC GIS offers a solution [2]. In order to improve the transmission capacity and power supply reliability, the long-term fault evolution and aging problem of HVDC GIS must be solved urgently, having important value in theoretical research and engineering applications. The damage degree of insulation defect to GIS is closely related to the type of insulation defect itself. Different types of insulation defects have different fault evolution laws and different influences on the aging of GIS insulating materials, which will also result in different maintenance and treatment measures. Therefore, accurately identifying the types of insulation defects inside GIS to perform fault projection vectors obtained by SODA may be of redundancy as the projection vectors in the same direction cannot improve the recognition ability. PC-DCA still suffers the problem of numerical instabilities of MD-FDA caused by the ill-conditioned within-class scatter matrix. Furthermore, there exist some serious fundamental errors and unreasonable aspects regarding BDCA in [4]. Firstly, the division method of signal-subspace and noise-subspace by Professor S. Y. Kung is not universal. Secondly, the assumption that all the eigenvalues of BDCA's discriminant matrix corresponding to the noise-subspace approximate to 1 is incorrect. Thirdly, there exist problems of numerical instabilities in the kernelization form of BDCA given in [4]. Lastly, the theories of BDCA lack rigorously mathematical proofs. All the problems mentioned above will be resolved by generalized discriminant component analysis (GDCA) as well as its kernelization forms proposed in this paper.
In addition, alternating current (AC) partial discharge (PD) pattern recognition research has accounted for the main proportion and is very mature [18,33], while DC PD pattern recognition has gradually been involved but is still in its infancy and has not formed a unified standard, due to lack of phase information, which mainly comprises the Centor score method [34], chaotic analysis method [35], NoDi pattern method [36], compressed sensing theory [37], support vector machine (SVM) [38,39], etc. The main deficiencies in current DC PD pattern recognition are detailed as follows: (a) There exist features extracted from the waveform of pulse current signal, which are influenced by the specific experimental platform and measurement system.
(b) Most of the related research papers are based on ideal defect models under some specific voltage level to perform PD tests, but the actual GIS operation site may have partial discharges from insulation defects of different voltage levels, types, locations and sizes. Even under the same defect type, the voltage level, defect size and defect location all have greater impacts on the DC PD pulse. The problem of DC PD pattern recognition for insulation defects with different sizes and locations under different voltage levels still needs to be effectively solved urgently.
(c) Most of the existing literature verifies the recognition accuracy of the corresponding PD pattern recognition method based on the assumption that sufficient experimental data can be obtained from the designed defect models in a laboratory environment. When the number of available discharges to be recognized is relatively small, whether the PD pattern recognition method can also be applied or not has not been verified.
(d) Except ensuring the recognition accuracy, how to reduce recognition time as much as possible so that reasonable measures will be taken as soon as possible to minimize the damage of insulation defects to GIS should be researched.
In order to solve the above problems, we built a set of 220 kV HVDC GIS experiment platform and manufactured four different types of insulation defects (including multiple sizes and locations). For each insulation defect, multiple voltage levels were set, ranging from the beginning of stable discharge to the final breakdown or the highest voltage that the experimental platform can provide, and stepwise-boosting voltages were applied with each voltage level lasting for 1 h. Finally, a total of 180,828 pulse current signals were successfully measured. Then, the apparent discharge quantity and the discharge time, two inherent physical quantities unaffected by the experimental platform and measurement system, were obtained after the pulse current signal was denoised, according to which 70 statistical features were extracted. In this paper, a pattern recognition method based on GDCA and its kernelization forms driven SVM is detailed and the corresponding selection criterion of involved parameters is established. Combining the Monte-Carlo experimental method with the cross-validation test strategy, a wealth of estimation indicators for classification results are calculated. The results show that the newly proposed pattern recognition method greatly improves the recognition accuracy in comparison with 36 kinds of state-of-theart dimensionality reduction algorithms and 44 kinds of state-of-the-art classifiers. This newly proposed method not only solves the difficulty that phase-resolved partial discharge (PRPD) cannot be applied under DC conditions but also immensely facilitates the fault diagnosis of HVDC GIS. The subsequent structure of this paper is arranged as follows: Section 2 introduces the GIS experimental platform and insulation defect settings; Section 3 describes the 70 statistical features extracted from the inherent physical quantities of pulse current signal; Section 4 proposes the theories and algorithms of GDCA and its kernelization forms; Section 5 gives the results and discussions of the newly proposed pattern recognition method based on GDCA and its kernelization forms driven SVM; and the paper is concluded in Section 6.

Experimental Platform
The schematic diagram of the experimental platform is shown in Figure 1, consisting of 220 V AC power supply (powered by WH38905 ultra-isolation transformer), ABB close switch AS, voltage regulator VR, step-up transformer BT (turns ratio is 1:1000), silicon stack D 1 and D 2 (rated rectifier current is 12 mA and rated inverse peak voltage value is 200 kV), protection resistor R 1 (1.6 MΩ), ZWF200-0.1 DC capacitor C 1 (composed of two capacitors 0.1010 µF and 0.1015 µF in series, both of which have a rated voltage of 200 kV), resistor divider (RD, divider ratio is 8000:1), multimeter, protection resistor R 2 (2.13 MΩ), high voltage bushing, test sleeve (mainly consisted of HV electrode, insulator, low-voltage (LV) electrode, and insulation support), SF 6 /N 2 gas filling device, signal detection impedance Z 1 and contrast detection impedance Z 2 (Z 1 and Z 2 are both RLC type and identical), coupling capacitor C k (197.8 pF), pulse current amplifier (PCAP), ultrasonic probe (UAP), ultrasonic amplifier (UAA), built-in UHF sensor (BUHFS), two DLM2054 oscilloscopes (the highest sampling rate is 2.5 GSa/s, and the bandwidth is 500 MHz) and one Agilent DSO-S 254 A oscilloscope (the highest sampling rate is 20 GSa/s, and the bandwidth is 2.6 GHz). Note that only pulse current signals are researched in this paper due to limited space; the other two kinds of signals, UHF signal and ultrasonic signal, will be researched in other papers. partial discharge (PRPD) cannot be applied under DC conditions but also immensely facilitates the fault diagnosis of HVDC GIS. The subsequent structure of this paper is arranged as follows: Section 2 introduces the GIS experimental platform and insulation defect settings; Section 3 describes the 70 statistical features extracted from the inherent physical quantities of pulse current signal; Section 4 proposes the theories and algorithms of GDCA and its kernelization forms; Section 5 gives the results and discussions of the newly proposed pattern recognition method based on GDCA and its kernelization forms driven SVM; and the paper is concluded in Section 6.

Experimental Platform
The schematic diagram of the experimental platform is shown in Figure 1, consisting of 220 V AC power supply (powered by WH38905 ultra-isolation transformer), ABB close switch AS, voltage regulator VR, step-up transformer BT (turns ratio is 1:1000), silicon stack D1 and D2 (rated rectifier current is 12 mA and rated inverse peak voltage value is 200 kV), protection resistor R1 (1.6 MΩ), ZWF200-0.1 DC capacitor C1 (composed of two capacitors 0.1010 μF and 0.1015 μF in series, both of which have a rated voltage of 200 kV), resistor divider (RD, divider ratio is 8000:1), multimeter, protection resistor R2 (2.13 MΩ), high voltage bushing, test sleeve (mainly consisted of HV electrode, insulator, low-voltage (LV) electrode, and insulation support), SF6/N2 gas filling device, signal detection impedance Z1 and contrast detection impedance Z2 (Z1 and Z2 are both RLC type and identical), coupling capacitor Ck (197.8 pF), pulse current amplifier (PCAP), ultrasonic probe (UAP), ultrasonic amplifier (UAA), built-in UHF sensor (BUHFS), two DLM2054 oscilloscopes (the highest sampling rate is 2.5 GSa/s, and the bandwidth is 500 MHz) and one Agilent DSO-S 254 A oscilloscope (the highest sampling rate is 20 GSa/s, and the bandwidth is 2.6 GHz). Note that only pulse current signals are researched in this paper due to limited space; the other two kinds of signals, UHF signal and ultrasonic signal, will be researched in other papers.

Insulation Defects
In this paper, four different types of insulation defects are manufactured, corresponding to solid insulation air gap discharge, surface discharge, floating discharge and point discharge, among which the solid insulation air gap defect adopts a self-made vacuum casting block using bisphenol-A epoxy resin shown in Figure 2, and the remaining

Insulation Defects
In this paper, four different types of insulation defects are manufactured, corresponding to solid insulation air gap discharge, surface discharge, floating discharge and point discharge, among which the solid insulation air gap defect adopts a self-made vacuum casting block using bisphenol-A epoxy resin shown in Figure 2, and the remaining three types of insulation defects are all set on the GIS post insulator shown in Figure 3. In order to take into account the influences of the defect's location and size on the pulse current signal, different defect locations or defect sizes are set for the same type of defect. The details are shown as Table 1. three types of insulation defects are all set on the GIS post insulator shown in Figure 3. In order to take into account the influences of the defect's location and size on the pulse current signal, different defect locations or defect sizes are set for the same type of defect. The details are shown as Table 1.

Statistical Features Extraction from the Inherent Physical Quantities
As stated in Sections 1 and 2, for each insulation defect, multiple voltage levels were set, ranging from the beginning of stable discharge to the final breakdown or the highest voltage that the experimental platform can provide, and stepwise-boosting voltages were applied with each voltage level lasting for 1 h. Finally, a total of 180,828 pulse current signals were successfully measured, consisting of 540 sample points (one sample point comprises the whole discharge data recorded during the 1 h experiment of the specific insulation defect under the corresponding voltage level, containing at least 50 continuous discharge signals). For each single pulse after being denoised, the corresponding apparent discharge quantity and discharge time, two inherent physical quantities unaffected by the  three types of insulation defects are all set on the GIS post insulator shown in Figure 3. In order to take into account the influences of the defect's location and size on the pulse current signal, different defect locations or defect sizes are set for the same type of defect. The details are shown as Table 1.

Statistical Features Extraction from the Inherent Physical Quantities
As stated in Sections 1 and 2, for each insulation defect, multiple voltage levels were set, ranging from the beginning of stable discharge to the final breakdown or the highest voltage that the experimental platform can provide, and stepwise-boosting voltages were applied with each voltage level lasting for 1 h. Finally, a total of 180,828 pulse current signals were successfully measured, consisting of 540 sample points (one sample point comprises the whole discharge data recorded during the 1 h experiment of the specific insulation defect under the corresponding voltage level, containing at least 50 continuous discharge signals). For each single pulse after being denoised, the corresponding apparent discharge quantity and discharge time, two inherent physical quantities unaffected by the

Statistical Features Extraction from the Inherent Physical Quantities
As stated in Sections 1 and 2, for each insulation defect, multiple voltage levels were set, ranging from the beginning of stable discharge to the final breakdown or the highest voltage that the experimental platform can provide, and stepwise-boosting voltages were applied with each voltage level lasting for 1 h. Finally, a total of 180,828 pulse current signals were successfully measured, consisting of 540 sample points (one sample point comprises the whole discharge data recorded during the 1 h experiment of the specific insulation defect under the corresponding voltage level, containing at least 50 continuous discharge signals). For each single pulse after being denoised, the corresponding apparent discharge quantity and discharge time, two inherent physical quantities unaffected by the experimental platform and measurement system as well as reflecting the inherent properties of the discharge sources, can be obtained accurately. All the statistical features extracted in this section are based on the above-mentioned two inherent physical quantities. In general, the extraction of statistical feature quantities is commonly based on the distribution function (continuous case) or probability distribution (discrete case) of onedimensional or multi-dimensional random variables, which we extend to PD data modes in this paper. PD data modes refer to the statistical relationship diagrams involved with the discharge time, the apparent discharge quantity or their corresponding differences, not necessarily representing probability distributions. The following discussion is focused on a certain discharge sample point.
Assume that the apparent discharge quantity sequence of the current discharge sample point is denoted as Q = {q r | r = 1, 2, · · · , SN}, where q r denotes the apparent discharge quantity of the rth single pulse belonging to the sample point and SN denotes the number of pulses in the current discharge sample point; the discharge time sequence is denoted as PDT = {PDT r | r = 1, 2, · · · , SN}, where PDT r denotes the discharge time of the rth discharge. With regard to the rth discharge, the forward discharge time interval is ∆t pre = PDT r −PDT r−1 and the backward discharge time interval is ∆t suc = PDT r+1 − PDT r ; the first-order difference of apparent discharge quantity is ∆q = q r − q r−1 and the first-order difference of discharge time interval is ∆(∆t) = PDT r − 2PDT r−1 + PDT r−2 . In addition, T denotes the duration of the sample point; n(q r ) and f PD (q r ) denote the discharge number and discharge repetition rate corresponding to the pluses with apparent discharge quantity equal to q r ; U denotes the DC voltage applied across the insulation defect when obtaining the sample point and U s denotes the corresponding initial voltage of partial discharge; WP r (r = 1, 2, · · · , SN) denotes the energy of the rth discharge and CP denotes the partial discharge cumulative product [34].
As stated above, a PD data mode does not always represent a kind of probability distribution. For a two-dimensional PD data mode, uniformly expressed as y i = f (x i ), it needs to be first analogized to be the probability distribution of a discrete random variable X. Let all possible values of X be denoted as x 1 , x 2 , · · · , x n (x 1 < x 2 < · · · < x n ), the corresponding probabilities are p 1 , p 2 , . . . , p n . The transformation formula is shown as Equation (1).
By Equation (1), we can calculate the statistical features of any two-dimensional PD data mode (when the PD data mode is a histogram of a certain random variable) or analogical statistical features (when the PD data mode is not a histogram of a certain random variable). The involved features of two-dimensional PD data modes comprise expectation (denoted as m 1 ), standard deviation (denoted as m 2 ), skewness (denoted as Skewness), kurtosis (denoted as Kurtosis) and the number of peaks (denoted as Peaks).
When the two-dimensional PD data mode represents a variable histogram, namely the probability distribution (it should be called the frequency distribution to be more precise, an estimate of the actual probability distribution using experimental data), we can use Weibull distribution (when the random variable is non-negative) or one-dimensional kernel density estimation [23] to fit the corresponding probability distribution. Using the maximum likelihood estimation method to fit the Weibull distribution, the corresponding scale parameter α and shape parameter β can be obtained. The kernel density estimation is a non-parametric method of estimating the probability density function. Assuming that the unary probability density function to be estimated is denoted as g, its kernel density estimation function is denoted asĝ in Equation (2), where K is a non-negative kernel function and h is a smoothing parameter or referred to as bandwidth. We adopt the adaptive kernel density estimator based on the linear diffusion process proposed by [40] to estimate the optimal smoothing parameter h best . Similarly, when the three-dimensional PD data mode represents a binary histogram, namely a two-dimensional probability distribution, two-dimensional kernel density estimation can be used to fit the corresponding probability distribution, and finally, two optimal smoothing parameters can be calculated [40], denoted as H x and H y . The energy entropy of the binary histogram can also be calculated by Equation (3), denoted as Entropy, where F i denotes the probability (frequency to be more precise) corresponding to the (i, j)th binary grid.
We first introduce each PD data mode used in this paper labelled from A to L, and then detail the extracted statistical features based on the corresponding PD data mode. D. H n (ln(∆t)) H n (ln(∆t)) is the frequency density histogram of ln(∆t). The extracted statistical features consist of m 1 , m 2 , Skewness, Kurtosis, Peaks, and the optimal smoothing parameter h best . In addition, the Weibull distribution can be used to fit the probability distribution function of ∆t, which must always be non-negative, to obtain the fitting parameters α and β.

E. Hq(CP)
Hq(CP) is the two-dimensional relationship diagram of PD cumulative product CP [34] calculated by Equation (4) and apparent discharge quantity q. The extracted statistical features consist of m 1 , m 2 , Skewness, and Kurtosis.
Since the discharge time interval ∆t generally does not keep the same, it is necessary to sort all ∆t first, and then set an appropriate interval range to divide ln(∆t) at equal intervals in order to make a PD data mode of q and ∆t. Let the total number of intervals be NI, q n (n = 1, 2, · · · , NI) denote the average of all the apparent discharge quantities in the nth interval and q max denote the corresponding maximum of all the apparent discharge quantities in the nth interval. Then, we can derive the following four PD data modes, of which the extracted statistical features consist of m 1 , m 2 , Skewness, Kurtosis and Peaks.
F. Hq n (ln(∆t suc )) Hq n (ln(∆t suc )) is the two-dimensional relationship diagram between q n and ln(∆t suc ).

G. Hq max (∆t suc )
Hq max (∆t suc ) is the two-dimensional relationship diagram between q max and ln(∆t suc ). Hq n (∆t pre ) is the two-dimensional relationship diagram between q n and ln(∆t pre ).

I.
Hq max (ln(∆t pre )) Hq max (∆t pre ) is the two-dimensional relationship diagram between q max and ln(∆t pre ). Analogous to the AC PD phase distribution pattern, describing the difference in the distribution shapes of the two-dimensional relationship diagrams corresponding to the positive and negative half cycle of power frequency period, the above four PD data modes can be used to construct two combination diagrams, one of which combines H qn (ln(∆t suc )) with H qn (ln(∆t pre )) and the other of which combines H qmax (ln(∆t suc )) with H qmax (ln(∆t pre )). From the combination diagrams, the extracted statistical features consist of cross-correlation factor and degree of asymmetry, denoted as CC and Asymmetry, respectively. Let y 1i (i = 1,2, · · · , n) represent the ordinate values of H qn (∆t suc ) or H qmax (∆t suc ) and y 2i (i = 1,2, · · · , n) represent the ordinate values of H qn (∆t pre ) or H qmax (∆t pre ); CC and Asymmetry can be calculated as Equation (5).
In this paper, binary joint distributions are also used as three-dimensional PD data modes labelled from J to L, which take the apparent discharge quantity q, the discharge time interval ∆t or their corresponding differences as the joint variables, of which the extracted statistical features consist of H x and H y , two optimal smoothing parameters of two-dimensional kernel density estimation, as well as energy entropy Entropy.

J.
H n (q, ln(∆t)) H n (q, ln(∆t)) is a binary histogram of the apparent discharge quantity q and the natural logarithm of discharge time interval ln(∆t), with two cases H n (q, ln(∆t suc )) and H n (q, ln(∆t pre )), illustrated as Figures 4 and 5, respectively, in which the fitting results of two-dimensional kernel density estimation are also given. Hqn(Δtpre) is the two-dimensional relationship diagram between qn and ln(Δtpre).

I. Hqmax(ln(Δtpre))
Hqmax(Δtpre) is the two-dimensional relationship diagram between qmax and ln(Δtpre). Analogous to the AC PD phase distribution pattern, describing the difference in the distribution shapes of the two-dimensional relationship diagrams corresponding to the positive and negative half cycle of power frequency period, the above four PD data modes can be used to construct two combination diagrams, one of which combines Hqn(ln(Δtsuc)) with Hqn(ln(Δtpre)) and the other of which combines Hqmax(ln(Δtsuc)) with Hqmax(ln(Δtpre)). From the combination diagrams, the extracted statistical features consist of cross-correlation factor and degree of asymmetry, denoted as CC and Asymmetry, respectively. Let y1i (i = 1,2, ⋯, n) represent the ordinate values of Hqn(Δtsuc) or Hqmax(Δtsuc) and y2i (i = 1,2, ⋯, n) represent the ordinate values of Hqn(Δtpre) or Hqmax(Δtpre); CC and Asymmetry can be calculated as Equation (5).
Asymmetry y y y y y n n In this paper, binary joint distributions are also used as three-dimensional PD data modes labelled from J to L, which take the apparent discharge quantity q, the discharge time interval Δt or their corresponding differences as the joint variables, of which the extracted statistical features consist of Hx and Hy, two optimal smoothing parameters of twodimensional kernel density estimation, as well as energy entropy Entropy.

J.
Hn(q, ln(Δt)) Hn(q, ln(Δt)) is a binary histogram of the apparent discharge quantity q and the natural logarithm of discharge time interval ln(Δt), with two cases Hn(q, ln(Δtsuc)) and Hn(q, ln(Δtpre)), illustrated as Figures 4 and 5, respectively, in which the fitting results of twodimensional kernel density estimation are also given.  Hn(Δq, ln(Δt)) is a binary histogram of the first-order difference of apparent discharge quantity Δq and the natural logarithm of discharge time interval ln(Δt), with two cases Hn(Δq, ln(Δtsuc)) and Hn(Δq, ln(Δtpre)), illustrated as Figures 6 and 7, respectively, in which the fitting results of two-dimensional kernel density estimation are also given.   K. H n (∆q,ln(∆t)) H n (∆q, ln(∆t)) is a binary histogram of the first-order difference of apparent discharge quantity ∆q and the natural logarithm of discharge time interval ln(∆t), with two cases H n (∆q, ln(∆t suc )) and H n (∆q, ln(∆t pre )), illustrated as Figures 6 and 7, respectively, in which the fitting results of two-dimensional kernel density estimation are also given. Hn(Δq, ln(Δt)) is a binary histogram of the first-order difference of apparent discharge quantity Δq and the natural logarithm of discharge time interval ln(Δt), with two cases Hn(Δq, ln(Δtsuc)) and Hn(Δq, ln(Δtpre)), illustrated as Figures 6 and 7, respectively, in which the fitting results of two-dimensional kernel density estimation are also given.  Hn(Δq, ln(Δt)) is a binary histogram of the first-order difference of apparent discharge quantity Δq and the natural logarithm of discharge time interval ln(Δt), with two cases Hn(Δq, ln(Δtsuc)) and Hn(Δq, ln(Δtpre)), illustrated as Figures 6 and 7, respectively, in which the fitting results of two-dimensional kernel density estimation are also given.  difference of discharge time interval ln|∆(∆t)|. Considering that ∆(∆t) is not always a positive number, the two cases of ∆(∆t) > 0 and ∆(∆t) < 0 are taken into account separately, illustrated as Figures 8 and 9, respectively, in which the fitting results of two-dimensional kernel density estimation are also given. Hn(Δq, ln|Δ(Δt)|) is a binary histogram of the first-order difference of apparent discharge quantity Δq and the natural logarithm of the absolute value of the first-order difference of discharge time interval ln|Δ(Δt)|. Considering that Δ(Δt) is not always a positive number, the two cases of Δ(Δt) > 0 and Δ(Δt) < 0 are taken into account separately, illustrated as Figures 8 and 9, respectively, in which the fitting results of two-dimensional kernel density estimation are also given.

GDCA and Its Kernelization Forms
This section promotes the supervised subspace projection technology from BDCA to GDCA and its kernelization forms. Some indispensable terminologies and symbols should first be introduced [4,23]: Recognition Vector: denoted as xi∈ℝ M×1 (i = 1, 2, ⋯, N), consists of all statistical features of the ith discharge sample point. The corresponding ones in the intrinsic vector space and the empirical vector space are denoted as  Hn(Δq, ln|Δ(Δt)|) is a binary histogram of the first-order difference of apparent discharge quantity Δq and the natural logarithm of the absolute value of the first-order difference of discharge time interval ln|Δ(Δt)|. Considering that Δ(Δt) is not always a positive number, the two cases of Δ(Δt) > 0 and Δ(Δt) < 0 are taken into account separately, illustrated as Figures 8 and 9, respectively, in which the fitting results of two-dimensional kernel density estimation are also given.

GDCA and Its Kernelization Forms
This section promotes the supervised subspace projection technology from BDCA to GDCA and its kernelization forms. Some indispensable terminologies and symbols should first be introduced [4,23]: Recognition Vector: denoted as xi∈ℝ M×1 (i = 1, 2, ⋯, N), consists of all statistical features of the ith discharge sample point. The corresponding ones in the intrinsic vector space and the empirical vector space are denoted as

GDCA and Its Kernelization Forms
This section promotes the supervised subspace projection technology from BDCA to GDCA and its kernelization forms. Some indispensable terminologies and symbols should first be introduced [4,23]: Recognition Vector: denoted as x i ∈R M×1 (i = 1, 2, · · · , N), consists of all statistical features of the ith discharge sample point. The corresponding ones in the intrinsic vector space and the empirical vector space are denoted as Different from the derivation of BDCA [4], the newly proposed GDCA in this paper starts directly from PC-DCA shown in Equation (6) and improves the corresponding constraint condition, which can be more robust and flexible than BDCA. Then, BDCA can be regarded as a special case of the proposed GDCA under a specific parameter value.
At first, the projection matrix of PC-DCA in Equation (6) is divided into two parts: signal-subspace projection matrix W PS and noise-subspace projection matrix W PN . Without loss of generality, suppose that the projected dimensionality m is larger than rank(S B ), and then Equation (6) can be transformed into Equation (7) according to the signal-subspace and the noise-subspace.
PC-DCA can be promoted to GDCA by improving the constraint W T S W W = I of Equation (7) to W T (S W + δI)W = I (δ > ρ, δ → 0 + and ρ → 0). Note that signal-subspace projection matrix is also denoted as W PS and noise-subspace projection matrix is also denoted as W PN in GDCA, shown as Equation (8): It can be seen from Equation (8) that GDCA degenerates into BDCA when ρ = 0. If we temporarily ignore the constraint that W T S B W = 0, it can be derived that the discriminant matrix of GDCA is (S W + δI) −1 (S C + ρI) and the signal-subspace projection matrix W PS consists of the eigenvectors of (S W + δI) −1 (S C + ρI) corresponding to the rank(S B ) larger eigenvalues while the noise-subspace projection matrix W PN consists of the eigenvectors of (S W + δI) −1 (S C + ρI) corresponding to the m-rank(S B ) smaller eigenvalues. It is only necessary to further prove that W PN has automatically approximately satisfied the constraint that W T PN S B W PN = 0 as follows: Let the m-rank(S B ) smaller eigenvalues of (S W + δI) −1 (S C + ρI) be arranged in ascending order to form a diagonal matrix Σ PN , so Equation (9) can be deduced.
Combining the constraint condition W T PN (S W + δI)W PN = I in Equations (8) and (9) can be equivalently converted to Equation (10), from which Equation (11) can be further obtained.
(1) Calculate the between-class scatter matrix S B , within-class scatter matrix S W , and center-adjusted scatter matrix S C (1.1) Use data preprocessing methods, such as standard normal density (SND) or min-max normalization (MMN), to preprocess the original recognition vectors [41]; (1.2) Denote recognition vectors after preprocessing as x i ∈R M×1 (i = 1, 2, · · · , N), then calculate S B , S W and S C .
(2) Calculate the projection matrix W GDCA (2.1) If m is not more than rank(S B ), W GDCA is consisted of the eigenvectors of (S W + δI) −1 (S C + ρI) corresponding to the rank(S B ) larger eigenvalues arranged in descending order. (2.2) If m is larger than rank(S B ), the signal-subspace projection matrix W PS is consisted of the eigenvectors of (S W + δI) −1 (S C + ρI) corresponding to the rank(S B ) larger eigenvalues arranged in descending order while the noise-subspace projection matrix W PN is consisted of the eigenvectors of (S W + δI) −

Results and Discussions
In order to demonstrate the advantages of the newly proposed pattern recognition method based on GDCA and its kernelization forms driven SVM, the test strategy of recognition effect based on the combination of Monte-Carlo experimental method and cross-validation is put forward firstly in this section, by which a wealth of estimation indicators for classification results can be calculated. Then, the criterion aimed at finding the optimal (α, δ) value-pair for GDCA and the optimal (γ, α, δ) value-pair for GDCA's kernelization forms is given, through which it is possible to optimally select the parameters involved in GDCA and its kernelization forms in advance without using the estimation indicators for classification results, greatly shortening the time of pattern recognition and ensuring the optimal recognition effect. Finally, results and discussions are detailed.

Test Strategy
First, random sampling is performed on the uniform distribution, so that the recognition vectors of all the discharge sample points are equally divided into five disjoint folds, and then 5-fold cross-validation is performed. Furthermore, the estimation indicators of each fold are calculated separately and the results of 5 folds are averaged. The above process can be regarded as one Monte-Carlo experiment. In order to reduce the impact of the randomness of the data set division on the estimation indicators of the final classification result, the above process is repeated 10 times, which means 10 Monte-Carlo experiments are performed. Finally, the results of the 10 Monte-Carlo experiments are averaged. The whole test strategy is shown in Figure 10. The Binary-SVM presented in Figure 10 adopts two-class support vector classification machine based on soft constraints, also referred to as Binary C-SVC [4]. Let x i (i = 1, 2, · · · , N) denote the PD recognition vector corresponding to the ith sample point input to Binary-SVM; y i (i = 1, 2, · · · , N), only equal to 1 or −1, denotes the class label of the ith sample point. It is worth noting that x i (i = 1, 2, · · · , N) can be recognition vectors in the original vector space or after being projected by means of GDCA or its kernelization forms. After solving the corresponding quadratic programming problem, we can obtain the decision function f (x) with regard to any PD recognition vector x.
Since the kernel matrix is a dense matrix in general and may be too large to store, Professor Chih-Jen Lin et al. [42] developed the LIBSVM toolbox, which is widely used for solving classification and regression problems due to its convenience of adjusting parameters, adopting an SMO-type decomposition method [43,44] to solve the quadratic programming problem.
The above-mentioned Binary-SVM is only specified for a two-class situation, and there mainly exist two kinds of methods to extend Binary SVM to Multiclass-SVM, namely one-versus-one scheme and one-versus-all scheme. The one-versus-one scheme needs to train one Binary-SVM for each possible pair of CN classes, which results in CN(CN − 1)/2 Binary-SVMs. The one-versus-all scheme consisted of CN Binary-SVMs, each of which is trained for one class and all the other classes. This paper adopts the one-versus-one scheme. However, basic Binary-SVM can only obtain the decision value of the test sample. In order to obtain posterior class probabilities, we firstly adopt Equation (13) to convert the decision values output by Binary-SVM into the estimated pairwise class probabilities r ij (i = j and i, j = 1, 2, · · · , CN), where parameters A and B can be obtained by solving the regularized maximum likelihood problem of maximizing the log-likelihood function in Equation (14), a kind of relative entropy or Kullback-Leibler divergence [45]. In Equation (14), t i denotes the maximum a posteriori (MAP) estimation for the target probability shown as Equation (15), consisted of two values, namely t + and t − , corresponding to positive and negative samples, respectively. Compared with t + = 1 and t − = 0, Equation (15) can effectively avoid the overfitting of Equation (13). . Figure 10. Test strategy of the newly proposed pattern recognition by combining Monte-Carlo experiment with crossvalidation.
Since the kernel matrix is a dense matrix in general and may be too large to store, Professor Chih-Jen Lin et al. 42 developed the LIBSVM toolbox, which is widely used for solving classification and regression problems due to its convenience of adjusting parameters, adopting an SMO-type decomposition method [43,44] to solve the quadratic programming problem.
The above-mentioned Binary-SVM is only specified for a two-class situation, and there mainly exist two kinds of methods to extend Binary SVM to Multiclass-SVM, namely one-versus-one scheme and one-versus-all scheme. The one-versus-one scheme needs to train one Binary-SVM for each possible pair of CN classes, which results in CN(CN − 1)/2 Binary-SVMs. The one-versus-all scheme consisted of CN Binary-SVMs, each of which is trained for one class and all the other classes. This paper adopts the one-versus-one scheme. However, basic Binary-SVM can only obtain the decision value of the test sample. In order to obtain posterior class probabilities, we firstly adopt Equation (13) to convert the decision values output by Binary-SVM into the estimated pairwise class probabilities rij ( i ≠ j and i, j = 1, 2, ⋯, CN), where parameters A and B can be obtained by solving the regularized maximum likelihood problem of maximizing the log-likelihood function in Equation (14), a kind of relative entropy or Kullback-Leibler divergence 45. In Equation In addition, in order to ensure the unbiasedness of the decision values used to estimate the parameters A and B, all the decision values in Equation (13) are obtained through 5-fold cross-validation of the training data set, which means the decision function is firstly obtained with 4-fold samples, then the decision values of the remaining 1-fold samples are calculated, and we repeat the above process until each training sample has a decision value. For the sample x t to be classified, we should firstly obtain its estimated pairwise class probabilities r ij (i = j and i, j = 1, 2, · · · , CN), then the optimization problem based on the pairwise coupling method in Equation (16) [46] is solved to obtain the final class probabilities p k t = P(y t = k x t ) (k = 1, 2, · · · , CN) of Multiclass-SVM. Finally, according to the maximum posterior probability criterion, the class that maximizes p k t is used as the predicted class of the test sample. The above pattern recognition can achieve the Bayes optimal decision under the condition of equal cost.
subject to the constraints : It can be seen from Figure 10 that there exist ten estimation indicators to evaluate the recognition results [41,47]. By extending two-class estimation indicators of pattern recognition to the multi-class situation using class ratio as weight, we can obtain each 1-fold estimation indicator. Then, each 5-fold estimation indicator can be obtained by averaging the corresponding results of all folds.

Criterion for Selecting Optimal Parameters of GDCA and Its Kernelization Forms
In the actual application of GDCA or its kernelization forms, the optimal (α, δ) or (γ, α, δ) value pair should be determined according to a certain criterion in order to establish the final pattern recognition algorithm, which also shown in Figure 10. In this section, we establish a criterion criterionf m as shown in Equation (17), which integrates the three technical indicators, namely SNR i (i = 1, 2, · · · , rank(S B )), OSNR m and WOSNR m 4. In Equation (17), h i (i = 1, 2, 3) denote the weights, and N a , N b and N c denote the number of calculated values of γ, α and δ, respectively.

Recognition Effects of GDCA and Its Kernelization Forms Driven SVM
This section gives the recognition effects of GDCA and its kernelization forms driven SVM. Many comparisons are performed, together with the effects of different values of regularization coefficient α on GDCA and its kernelization forms driven SVM researched.

Recognition Effect of GDCA Driven SVM
According to the flowchart shown in Figure 10 values of δ, with which GDCA driven SVM outperforms original SVM with regard to all the estimation indicators except MASpecificity under all the values of α. All the standard deviations of 10 Monte-Carlo experiments for each estimation indicator by GDCA driven SVM are less than those by original SVM, which means GDCA driven SVM is more robust than original SVM. In addition, by maximizing criterionf10 established in Section 5.2, it is derived that αopt = 0.9, δopt = 10 −4 , under which MACA can arrive at the maximum of 92.41%. The consuming time of executing all the Monte-Carlo experiments by GDCA driven SVM has the mean of 403.9399 s and the standard deviation of 53.3158 s, while the time consumed by original SVM is 1697.9388 s. The above time is counted using MATLAB on a personal computer with 2.50 GHz CPU and 16.0 GB RAM.
The increased percentage and the sign of estimation indicators comparing GDCA driven SVM with BDCA driven SVM under different values of α are shown in Figure 12 (only display the case of α = 0.5 due to limited space). Results show that the range of δ, in which GDCA outperforms BDCA with regard to all the estimation indicators, generally expands as α expands. Especially, GDCA (0 < α < 1) is superior to BDCA in the majority span of δ. The increased percentage and the sign of estimation indicators comparing GDCA driven SVM with BDCA driven SVM under different values of α are shown in Figure 12 (only display the case of α = 0.5 due to limited space). Results show that the range of δ, in which GDCA outperforms BDCA with regard to all the estimation indicators, generally expands as α expands. Especially, GDCA (0 < α < 1) is superior to BDCA in the majority span of δ.

Recognition Effect of GDCA's Kernelization Forms Driven SVM
According to the flowchart shown in Figure 10

Recognition Effect of GDCA's Kernelization Forms Driven SVM
According to the flowchart shown in Figure 10,   Figure 13 (only the cases under α = 0.9 are displayed due to limited space). The results show that there exist large numbers of combinations of (γ, α, δ), by which KGDCA-Intrinsic-Space/KGDCA-Empirical-Space driven SVM is superior to original SVM. In addition, the range of γ, in which KGDCA-Intrinsic-Space/KGDCA-Empirical-Space outperforms original SVM, generally expands as δ decreases. In addition, by maximizing criterionf10 established in Section 5.2, it is derived that γopt = 2 −3 , αopt = 0.9, δopt = 10 −5 for KGDCA-Intrinsic-Space driven SVM, under which MACA can arrive at the maximum of 100%, meaning the results that all the test samples of all the Monte-Carlo experiments have been classified successfully and γopt = 2 −1 , αopt = 0.9, δopt = 10 −6 for KGDCA-Empirical-Space driven SVM, under which MACA can arrive at 99.83%.

Comparisons with Other Dimensionality Reduction Algorithms
In this section, we use the test strategy in Section 5.1 to compare the newly proposed method with 36 kinds of state-of-the-art dimensionality reduction algorithms. Although we calculate 10 estimation indicators to evaluate the recognition results, only results of MACA are shown in Figure 16 due to limited space. It can be seen from the results that the proposed method outperforms all the compared feature selection ones, comprising filter type, wrapper type and embedded type, since the proposed method uses all the in-

Comparisons with Other Dimensionality Reduction Algorithms
In this section, we use the test strategy in Section 5.1 to compare the newly proposed method with 36 kinds of state-of-the-art dimensionality reduction algorithms. Although we calculate 10 estimation indicators to evaluate the recognition results, only results of MACA are shown in Figure 16 due to limited space. It can be seen from the results that the proposed method outperforms all the compared feature selection ones, comprising filter type, wrapper type and embedded type, since the proposed method uses all the information contained in the recognition vectors but feature selection methods inevitably discard some useful information. In the meantime, the wrapper type and embedded type are always classifier-dependent and not only computationally intensive but also at risk of overfitting. In addition, the proposed method also outperforms all the compared unsupervised subspace projection ones, which can be attributed to the fact that the unsupervised subspace projection technology does not involve any class information. Even compared with supervised subspace projection technologies, our proposed method still demonstrates competitive performances with significant advantages.

Comparisons with Other Dimensionality Reduction Algorithms
In this section, we use the test strategy in Section 5.1 to compare the newly proposed method with 36 kinds of state-of-the-art dimensionality reduction algorithms. Although we calculate 10 estimation indicators to evaluate the recognition results, only results of MACA are shown in Figure 16 due to limited space. It can be seen from the results that the proposed method outperforms all the compared feature selection ones, comprising filter type, wrapper type and embedded type, since the proposed method uses all the information contained in the recognition vectors but feature selection methods inevitably discard some useful information. In the meantime, the wrapper type and embedded type are always classifier-dependent and not only computationally intensive but also at risk of overfitting. In addition, the proposed method also outperforms all the compared unsupervised subspace projection ones, which can be attributed to the fact that the unsupervised subspace projection technology does not involve any class information. Even compared with supervised subspace projection technologies, our proposed method still demonstrates competitive performances with significant advantages.

Comparisons with Other Classifiers
In this section, we use the test strategy in Section 5.1 to compare the newly proposed method with other state-of-the-art classifiers adopted in mainstream pattern recognition methods, composed of ten kinds of neural networks 48, classical rough set (CRS) 49, neighborhood classifier (NEC) 50, K nearest neighbor classifier (KNN), classification and regression tree (CART), C4.5, Naive Bayes classifier (NBC), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), kernelized discriminant analysis (KDA) 4 and ensemble algorithms (including bootstrap aggregating and random subspace ensembles) 51, as well as their combinations with feature selection (also referred to as attribute reduction) using neighborhood rough set (NRS) 50. All the involved parameters in the above methods are chosen according to 5-fold cross validation. 5.5.1. Comparisons with Ten Kinds of Neural Networks

Comparisons with Other Classifiers
In this section, we use the test strategy in Section 5.1 to compare the newly proposed method with other state-of-the-art classifiers adopted in mainstream pattern recognition methods, composed of ten kinds of neural networks [48], classical rough set (CRS) [49], neighborhood classifier (NEC) [50], K nearest neighbor classifier (KNN), classification and regression tree (CART), C4.5, Naive Bayes classifier (NBC), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), kernelized discriminant analysis (KDA) [4] and ensemble algorithms (including bootstrap aggregating and random subspace ensem-  [51], as well as their combinations with feature selection (also referred to as attribute reduction) using neighborhood rough set (NRS) [50]. All the involved parameters in the above methods are chosen according to 5-fold cross validation.

Comparisons with CRS
CRS is specified for discrete features, so it is indispensable to carry out proper discretization of continuous features before using CRS. Eight discretization methods were chosen, comprising four unsupervised algorithms based on hierarchical clustering (HRC), K-means, Gaussian mixing model (GMM) [48], fuzzy C-means clustering (FCM) [52], together with four supervised algorithms based on ChiMerge, information entropy (IFE), class-attribute contingency coefficient (CACC) and class-attribute interdependence maximization (CAIM) [53], the corresponding results of which are shown in Figure 17b, from which it can be seen that both GDCA driven SVM and GDCA's kernelization forms driven SVM are superior to CRS with all the eight discretization methods.

Conclusions
By building a set of 220 kV HVDC GIS experiment platform and manufacturing four different types of insulation defects (including multiple sizes and positions), we successfully measured 180,828 pulse current signals under multiple voltage levels. After being denoised, the apparent discharge quantity and the discharge time, two inherent physical quantities unaffected by the experimental platform and measurement system, were obtained, according to which 70 statistical features were extracted. We detailed a pattern recognition method based on generalized discriminant component analysis and its kernelized forms driven SVM, and established the corresponding selection criterion of involved parameters. Combining the Monte-Carlo experimental method with the crossvalidation test strategy, 10 evaluation indicators for classification results were calculated. Then, recognition effects of GDCA and its kernelization forms driven SVM including comparisons between each other were analyzed in detail. Finally, comparisons between the newly proposed pattern recognition method and 36 kinds of state-of-the-art dimensionality reduction algorithms together with 44 kinds of state-of-the-art classifiers were performed. The following conclusions can be drawn: (1) All the problems of BDCA mentioned in Section 1 can be resolved by GDCA as well as its kernelization forms proposed in this paper. The range of δ, in which GDCA outperforms BDCA with regard to all the estimation indicators, generally expands as α expands. Especially, GDCA (0 < α < 1) is superior to BDCA in the majority span of δ. In the overwhelmingly major combinations of γ and δ under all the values of α, KGDCA-Intrinsic-Space/KGDCA-Empirical-Space outperformed GDCA. (2) By establishing an effective criterion to optimally select the parameters involved in GDCA and its kernelization forms in advance without using the evaluation indicators of classification results, the time of pattern recognition can be shortened considerably to ensure the optimal recognition effect simultaneously. Due to the fact that only the apparent discharge quantity and the discharge time, two inherent physical quantities unaffected by the experimental platform and measurement system, are needed, this newly proposed method not only solves the difficulty that phase-resolved partial discharge (PRPD) cannot be applied under DC conditions, but also immensely facilitates the fault diagnosis of HVDC GIS.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
The following gives the rigorous mathematical proof of the proposition that GDCA algorithm does meet the SNR criterion in the signal-subspace and the noise-power criterion in the noise-subspace. The whole proof is consisted of Proposition 1, Proposition 2 and Proposition 3 as follows.
Proposition A1: The rank(S B ) larger eigenvalues of GDCA's discriminant matrix (S W + δI) −1 (S C + ρI) are larger than 1, the other M-rank(S B ) eigenvalues are smaller than 1, and all the eigenvalues are at least equal to ρ/δ.

Proof of Proposition A1
Firstly, we provide the evidence that the rank(S B ) larger eigenvalues of GDCA's discriminant matrix (S W + δI) −1 (S C + ρI) are larger than 1, and the other M-rank(S B ) eigenvalues are smaller than 1.
Since S W is a real-symmetric positive semi-definite matrix and δ → 0 + , S W + δI must be a real-symmetric positive definite matrix. According to the Cholesky decomposition theorem, S W + δI can be expressed as the product of a lower triangular matrix L WI whose diagonal elements are all positive and its transpose, shown as Equation (A1): Let Σ GDCA be a diagonal matrix whose diagonal elements are consisted of the rank(S B ) larger eigenvalues of (S W + δI) −1 (S C + ρI) arranged in descending order and the M-rank(S B ) smaller eigenvalues arranged in ascending order. Then, Equation (A2) can be derived (m = M here).
We further define a matrix H I as Equation (A3): Secondly, we prove all the eigenvalues of (S W + δI) −1 [S B + (ρ − δ)I] are not less than ρ/δ. The proof can be performed by contradiction, and the details are as follows: Assume that there exists a real λ smaller than ρ/δ and λ is an eigenvalue of (S W + δI) −1 [S B + (ρ − δ)I], which corresponds to the eigenvector ν. Then, Equation (A5) can be derived. It can be deduced from ρ < δ and λ < ρ/δ that 1−λ > 0 and ρ-λδ > 0. Because of the fact that both S B and S W are real-symmetric positive semi-definite matrices as well as the fact that (ρ − λδ)I is a real-symmetric positive definite matrix, S B + (1 − λ)S W + (ρ − λδ)I is actually a real-symmetric positive definite matrix, which is contradictory to Equation (A5). Therefore, all the eigenvalues of (S W + δI) −1 [S B + (ρ − δ)I] are not less than ρ/δ. Particularly, when S B + (1 − ρ/δ)S W is a singular matrix, the smallest eigenvalue of (S W + δI) −1 [S B + (ρ − δ)I] is exactly equal to ρ/δ.
w T i [S B + (ρ − δ)I]w i = (λ i − 1)w T i (S W + δI)w i for i = 1, 2, · · · , rank(S B ) (A7) It can be seen from Equation (8)  Let Λ GDCA be a diagonal matrix whose diagonal elements are consisted of all the eigenvalues µ i (i = 1, 2, · · · , M) of S W + δI arranged in the descending order. Due to the fact that S W is a real-symmetric positive semi-definite matrix and in general must have at least one positive eigenvalue, µ 1 ≥ µ 2 ≥ · · · ≥ µ M ≥ δ > 0 and ∃i∈[1, M] so as to make µ i larger than δ. Therefore, Equation (A11) can be derived from Equation (A10).
Because λ 1 ≥ λ 2 ≥ · · · ≥ λ rank(S B ) > 1, it can be seen from Equations (A12) and (A13) that SNRs of projected components obtained by the signal-subspace projection matrix W PS are all larger than zero and arranged in descending order.
Proposition A3: The noise powers of projected components obtained by the noise-subspace projection matrix W PN are arranged in ascending order.
It can be seen from Equation (8)   It is noted from Equation (A19) that projection vectors should be normalized by their own 2-norm in order to avoid the influence of projection vectors' norm on the noise power. On account of the fact that δ > ρ and ρ/δ ≤ λ M ≤ λ M−1 ≤ · · · ≤ λ M−m+rank(S B )+1 < 1, it can be seen from Equation (A19) that the noise powers of projected components obtained by the noise-subspace projection matrix W PN are arranged in ascending order and non-negative.