Transmission Line Fault-Cause Identiﬁcation Based on Hierarchical Multiview Feature Selection

: Fault-cause identiﬁcation plays a signiﬁcant role in transmission line maintenance and fault disposal. With the increasing types of monitoring data, i.e., micrometeorology and geographic information, multiview learning can be used to realize the information fusion for better fault-cause identiﬁcation. To reduce the redundant information of different types of monitoring data, in this paper, a hierarchical multiview feature selection (HMVFS) method is proposed to address the challenge of combining waveform and contextual fault features. To enhance the discriminant ability of the model, an ε -dragging technique is introduced to enlarge the boundary between different classes. To effectively select the useful feature subset, two regularization terms, namely l 2,1 -norm and Frobenius norm penalty, are adopted to conduct the hierarchical feature selection for multiview data. Subsequently, an iterative optimization algorithm is developed to solve our proposed method, and its convergence is theoretically proven. Waveform and contextual features are extracted from yield data and used to evaluate the proposed HMVFS. The experimental results demonstrate the effectiveness of the combined used of fault features and reveal the superior performance and application potential of HMVFS.


Introduction
Transmission lines cover a wide area and work in diverse outdoor environments to achieve long-distance, high-capacity power transmission. In order to maintain stable power supply, high-speed fault diagnosis is indispensable for line maintenance and fault disposal.
Traditional fault diagnosis technologies concerning fault detecting, fault locating, and phase selection are well developed [1,2], while diagnosis on external causes is still underdeveloped. Operation crews attach great importance to fault location for line patrol and manual inspection. However, on-site inspection is labor-intensive and depends on subjective judgment. Moreover, cause identification after inspection is too late for dispatchers to give better instructions according to the external cause, such as forced energization. Fault-cause identification is expected to help dispatch and maintenance personnel make a proper and speedy fault response.
Transmission line faults are more often triggered by external factors due to environmental change or surrounding activities. Though the cause categories are slightly different between regions or institutions, the common causes can be listed as lighting, tree, animal contact, fire, icing, pollution and external damage [3]. Considering complexity and variability of open-air work, it is hard to model fault scenarios for diverse root causes [4,5]. Thus, these existing studies on line fault-cause identification have been developed based on data-driven methods rather than physical modeling.
The early identification methods were rule-based, such as statistical analysis, CN2 rule induction [6] and fuzzy inference system (FIS) [7][8][9]. Their identification frameworks are finally presented in the form of logic flow, demanding a great degree of robustness and generality for their rules or thresholds. In recent years, various machine learning (ML) techniques that attach great importance to hand-crafted features have been applied to diagnose external causes [10][11][12][13][14], such as logistic regression (LR), artificial neural network (ANN), k-nearest neighbor (KNN) and support vector machine (SVM). Deep learning (DL) provides a more efficient way in the field of fault identification. In [15], deep belief network (DBN) is used as the classification algorithm after extracting time-frequency characteristics from traveling wave data. Even when using DL methods, feature engineering is still an inevitable part to achieve high accuracy.
Feature signature study provides knowledge about fault information and plays a critical role in fault-cause identification. On the one hand, when fault events happen, power quality monitors (PQMs) enable us to have easy access to electrical signals and time stamps [16]. Time-domain features extracted from fault waveform and time stamp were used to construct logic flow to classify lightning-, animal-and tree-induced faults [6]. To exploit transient characteristics in the frequency domain, signal processing techniques such as wavelet transform (WT) and empirical mode decomposition (EMD) are used for further waveform characteristic analysis [17][18][19][20]. In [21], a fault waveform was characterized based on the time and frequency domain to develop an identification logic. However, a fault waveform is easily affected by the system operation state, and there is no direct connection between these characteristics and external causes. On the other hand, weather condition is directly relevant to many fault-cause categories such as lightning, icing and wind. With the development of monitoring equipment and communication technology, dispatchers now can make judgments with more and more outdoor information [22]. These nonwaveform characteristics such as time stamps, environment attributes and other textual data are called contextual characteristics in this paper. Table 1 lists and compares the characterization and classification methods in existing works. √ √ √ √ CN2 Liang, Li [7] √ √ FIS * Xu, Chow [8][9][10] √ √ √ FIS/LR/ANN * Cai, Chow [11] √ √ √ LR Chang, Hong [12] √ √ SVM * Jiang, Liu [14] √ √ √ √ √ KNN Liang, Liu [15] √ √ √ DBN Asman, Aziz [20] √ √ decision tree * Qin, Wang [21] √ √ √ √ √ logic flow * Dehbozorgi, Rastegar [22] √ √ decision tree Minnaar, Nicolls [23] √ √ √ √ √ KNN Articles with * concern faults on distribution network but their work is still inspiring for transmission network.
Studies have shown that waveform and contextual features can achieve high accuracy without each other, but there are high data requirements. For economic and operational reasons, data condition will not change significantly in the short term. It is necessary to study performance improvement for fault-cause identification based on current data conditions. One of the challenges is determining how to combine waveform features and multisource contextual features. This is an information fusion problem, and the simplest approach is feature concatenation. The authors of [23] tried to combine contextual features and waveform features as a mixed vector, but concatenated features reduce performance. Moreover, in contrast to focusing on either side, a few studies use both waveform and contextual characteristics for higher classification performance.
To tackle the fusion challenge, multiview learning (MVL) is introduced in this paper because waveform and contextual features describe the same fault event in different views. MVL aims to integrate multiple-view data properly and overcome biases between multiple views to obtain satisfactory performance. One of typical MVL methods is canonical correlation analysis (CCA), which maps multiview features into a common feature space [24]. Instead of mapping features, multiview feature selection that selects features from each view is preferred in fault-cause identification. Unlike traditional feature selection, multiview feature selection treats multiview data as inherently related and ensures that complementary information from other views is exploited [25,26]. In [27], a review on real-time power system data analytics with wavelet transform is given. The use of discrete wavelet transform was used to identify the high impedance fault and heavy load conditions [28]. The authors of [29] propose a fault diagnosis approach for the main drive chain in a wind turbine based on data fusion. To deal with the kind of multivariable fault diagnosis problem for which input variables need to be adjusted for different typical faults, the deep autoencoder model is adopted for the fault diagnosis model training for different typical fault types.
In this paper, we propose a hierarchical multiview feature selection (HMVFS) method for transmission line fault-cause identification. Two view datasets are composed of the waveform features and the contextual features. Our proposed HMVFS is applied to conduct the feature selection for the optimal feature combination. In our model, to enhance the discriminant ability of regression, an ε-dragging technology is used to enlarge the margin between classes. Next, two regularization terms, namely l 2,1 -norm and Frobenius norm (F-norm) penalty, are adopted to perform the hierarchical feature selection. Here, the l 2,1 -norm realizes the row sparsity to reduce the unimportant features of each view and the F-norm realizes the view-level sparsity to reduce the diversity between these two-view data. Hence, these two penalties can be viewed as low-level and high-level feature selection, respectively. At last, the fault-cause identification is carried out using ML classifiers and integrated features. The contributions of this paper are highlighted as follows:

•
To the best of our knowledge, this is the first time that multiview learning is introduced for transmission line fault-cause identification in view of the nature of multiview fault data.

•
We propose a novel approach, HMVFS, based on the ε-dragging and two regularization terms to select the discriminative features across views. We also develop an iterative algorithm to solve the optimization problem and prove its convergence theoretically.

•
The performance of HMVFS is evaluated on field data and compared with classical feature selection methods. Experimental results prove the effectiveness of combining waveform and contextual features and demonstrate the feasibility and superiority of HMVFS.
The rest of this paper is organized as follows: Section 2 presents the proposed HMVFS algorithm and its convergence analysis. Section 3 outlines the real-life line fault dataset and extracts features in terms of waveform and nonwaveform. The empirical study is provided and discussed in Section 4. Section 5 presents concluding remarks.

Notation
Sparsity-based multiview feature selection can be formulated as an optimization problem and denoted by loss functions and regularization items. Before introducing our formulation, the notation is stated.
Matrices are denoted by boldface uppercase letters, and vectors are denoted by boldface lowercase letters. Given original feature matrix X = [x 1 , x 2 , . . . , x n ] T ∈ R n×d , each row of which corresponds to a fault instance, n is the total number and d denotes the size of features. X (v) ∈ R n×d (v) and x (v) i ∈ R d (v) denote a feature matrix and a vector in the vth view. There are two views in this paper; thus, X = [X (1) , X (2) ]. Suppose there are c categories, the label matrix will be represented as Y = [y 1 , y 2 , . . . , y n ] T ∈ {0, 1} n×c . Weight matrix W can be derived as W = [W (1) , W (2)

The Objective Function
Given the notation defined and a fault dataset (X, Y), the problem of HMVFS is transformed into determining weight matrix W and then ranking features for selection. We formulate the optimization problem as where m is the view number; m = 2 in this paper. In this formulation, Ψ(W, M) is the loss function that measures the calculation distance to achieve minimum regression error, which is derived from the least square loss function. Furthermore, the ε-dragging is introduced to drag binary outputs in Y away along two opposite directions. The outputs for positive digits will become 1 + ε i and the outputs for negative digits will be −ε i , in which all of the εs are nonnegative. The treatment that enlarges the distance between data points from different classes helps to develop a compact optimization model for classification [30]. B ∈ {−1, 1} n×c in the formulation is a constant matrix, and its element B ij is defined as B ij denotes the dragging direction for elements in label matrix Y. M ∈ R n×c is a nonnegative matrix that records all εs. The operator ⊗ is the Hadamard product operator of matrices. Thus, B ⊗ M represents the dragging distance, and we have a new label matrix after the ε-dragging: With the least square loss function defined as we can attain our loss function Ψ(W,B,M).
Next, regularization items used in the formulation are l 2,1 -norm and F-norm, and we take row-wise feature selection and view-wise feature selection into account.
l 2,1 -norm measures the distance of features as a whole and forces the weights of unimportant features to be assigned small values so that it can perform feature selection among all features. Similarly, F-norm measuring the distance between views forces the weights of unimportant views to be assigned small values [31]. The weight matrix W is regulated by these penalty terms, and hierarchical feature selection is completed with row-wise and view-wise selection. l 2,1 -norm penalty corresponds to the low-level feature selection, and F-norm penalty corresponds to the high-level feature selection. Therefore, the objective function of the HMVFS model is obtained and represented as (1). α and β are nonnegative constants that tune hierarchical feature selection. This model is also available with more than two views.

Optimization
In order to solve l 2,1 -norm minimization and F-norm minimization problems, the regularization terms W 2,1 and m ∑ v=1 W v F need to be respectively relaxed by Tr(W T CW) and Tr(W T DW) [32]. The objective function is rewritten as where C ∈ R d×d and D ∈ R d×d are diagonal matrices and derived from W.
Though two more variables are introduced, we obtain a convex function, and we can solve the optimization problem iteratively. In each iteration, we update one variable while others are fixed, and all variables can be optimized in order. In view of C and D derived from W, we fix M and update W at first. The derivative of (8) w.r.t. W is calculated as Let (9) equal zero, then the updated W can be obtained by solving the equation. If there are big-size data or high-dimensional data, the gradient descent method is recommended. Following that, C and D can be updated.
When it turns to M, the optimization problem can be transformed from (8) to (10).
According to the definition of F-norm, this problem can be decoupled into n × c subproblems [30] and represented as With B ij 2 = 1, (11) is equivalent to (12).
With the nonnegative constraint, M ij is calculated as Accordingly, M can be updated as Up to now, all variables are updated in the iteration and we present the optimization process in Algorithm 1.
After optimization, we obtain weight matrix W learned across all views and then sort all features according to their importance. The importance is measured by the l 2 -norm value of each row vector of W, w i 2 (i = 1, 2, . . . , d). Feature selection can be completed with features ranked in descending order.

Convergence
In this subsection, we analyze the convergence of Algorithm 1. We need to guarantee the objective function decreases in each iteration of the optimization algorithm. The following lemma is used to verify its convergence.

Lemma 1.
For any nonzero values a, b ∈ R, the following inequality holds: Theorem 1. The objective Function (1) monotonically decreases in the iteration of Algorithm 1.
Proof. According to Step 6 and Step 7 in Algorithm 1, we have W t+1 and M t+1 as follows: Firstly, according to (16) and (17), there is Thus, according to the definition of C, we have We also perform the same transformation with Tr(W T t+1 D t W t+1 ), Tr(W T t C t W t ) and Tr(W T t D t W t ). We can rewrite (18) as According to Lemma 1, we arrive at Thus, Algorithm 1 decreases the optimization problem in (1) for each iteration so (1) will converge to its global optimum according to its convexity.

Algorithm 1 The optimization algorithm for (8)
Input: The feature matrix across all views, X ∈ R n×d ; the label matrix, Y ∈ {0, 1} n×c ; the parameters α and β Output: The weight matrix across all views, W ∈ R d×c 1: Calculate B from Y via (2) 2: Initialize W 0 and M 0 3: Initialize t = 0 4: Repeat 5: Calculate C t and D t from W t 6: 8: t = t + 1 9: Calculate residue via (1) 10: Until convergence or maximum iteration number achieved

Data Collection and Cleaning
In this study, the fault data were collected from an AC transmission network located in a coastal populous city in Guangdong Province, China. These faults occurred between 2016 and 2019, and the voltage levels varied from 110 to 500 kV. Fault signals were recorded by digital fault recorders (DFRs) installed on substations. The DFR equipment involves PMUs and computer systems to synchronize, store and display analog data for voltage and current signals. These signals can be remotely accessed through a communication network and provide offline data stored in common format for transient data exchange (COMTRADE). The sampling rate is 5 kHz in the dataset. Environmental information and other associated monitoring data were obtained through the inner maintenance system. A patrol report of manual inspection was attached to each fault, describing the inspection result and labeling its cause. The original dataset comprised 551 samples, and 288 of them remained after cleansing. The distribution of fault-cause categories is shown in Figure 1. Lightning, external force and object contact are the three dominant causes. External force refers to collision or damage due to human activity. Object contact is usually caused by floating objects in the air. These are typical causes in a densely populated city, causing more than 90% of known faults.

Waveform Characteristics
It is believed that the disturbance variation of electrical quantity after faults occurring contains important transient information for fault diagnosis [33]. The original waveform data are recorded in COMTRADE files with the sampling frequency of 5 kHz. The first step is to acquire fault segments and extract valid waveform segments without disturbance caused by tripping. In this paper, the beginning of valid segments is determined by inspection thresholds based on root mean squared (rms) current magnitude. dI is the difference between consecutive values. dI ≥ 0.15 pu or I ≥ 1.2 pu.
The start thresholds are determined by inspection to make sure that fault measurements in this study are correctly captured. Since COMTRADE stores not only electrical signals in analog channels but also tripping information in digital channels, one and a half cycles after tripping enabling signal is regarded as the end of the segment. In characterization, we extend previous research work on waveform characterization. The following waveform features are considered and extracted.

1.
Maximum Change of Sequence Components: Instantaneous magnitude is calculated relative to prefault amplitude in order to be compatible with measurements from different voltage levels and operation conditions. Karenbauer transformation is used to obtain zero, positive and negative components of three-phase signals, denoted by s, s = 0, 1, 2. 2.
Maximum Rate of Change of Sequence Components: 3.
Sequence Component Values at t-cycle: t is set to be 0, 0.5, 1 and 1.5. For instance, t = 0.5 means the measuring point is 1/2 cycle from the start.
4. Custom Time Constant of Sequence Current: Inspired by a linear time-invariant system, time content is introduced to reflect the dynamic response of the network [23]. Time content is the time required to rise from the zero point to 1/e of the maximum current. In this study, 1/e is replaced with a custom value, m. These features are denoted as TC_I s(m) , m = 0.1, 0.2, . . . , 0.9, 1 5.
DC and Harmonic Content: Hilbert-Huang transform is used to conduct spectrum analysis [17]. The harmonic content and DC content are calculated from the ratio of the specific component to the fundamental component. DC and harmonic content are denoted as Har_k, k = 0, 3, 5, 7, 9, 11 6.
Wavelet Energy and Energy Entropy: Discrete wavelet transform is applied to decompose fault-phase current signals into three wavelet scales. Wavelet energy E and energy entropy S are calculated for each scale.
where C j , E j , p j denote wavelet coefficient, wavelet energy and relative energy in scale j, j = 1, 2, 3.

7.
Maximum DC Current: Equation (30) is used to calculate the maximum DC current on three-phase signals. N s is the number of data points in one cycle, and n = 0 means the triggering point.
8. Time Domain Factors: Form factor, crest factor, skewness and kurtosis, denoted as t 1 -t 4 , respectively, are introduced to reflect characteristics of waveform shape and the shock for fault-phase current signals. SD denotes their standard deviation.
9. Approximation Constants δ for Neural Waveform: In order to learn more from the front wave, the waveform of rms neutral voltage/current is approximated by (32), as introduced in [33].
where t is time step and δ is the approximation constant. Equation (32)  All waveform features are listed in Table 2. Faulted phase features are included in the next subsection.

Contextual Characteristics
Most monitoring technologies are developed for specified causes and work independently with interconnected data. In this study, due to data restriction, available nonwaveform data include time stamps, meteorological data, geographical data, protection data and query information. These informative values are preprocessed and integrated into the pool of candidate contextual features, as shown in Table 2. Considering that there is no accurate discretization standard, we only discretize text data roughly if necessary. The time stamp information is discretized twice based on season and day/night as a contrast of months and daytime. As for dynamic records such as meteorological value, the records closest to the fault time are retained. Protection data are feedback information of protection devices after fault, usually obtained from the production management system. Although these collected data are related to fault events, they are not suitable for fault cause identification. These irrelevant features pose a great challenge in feature selection.

Experiment Setup
To validate the effectiveness and efficiency of HMVFS, we conducted comparison experiments using the mentioned field data previously. Three strategies for utilizing multiview data with feature selection were considered, namely single-view learning, feature concatenation after selection and feature selection after concatenation. The last two are the simplest early fusion methods. Single-view learning is represented via best single view (BSV) method, through which the most informative view achieves the best performance among views. As for the dataset in this paper, contextual features are more representative than hand-crafted waveform features. Feature concatenation after selection (FSFC) employs a feature selection technique separately and concatenates features selected from different views. Feature selection after concatenation (FCFS) concatenates original feature sets of two views and then performs feature selection. Adaptable feature selection methods listed in the next subsection are applied to select discriminative features.
The fault dataset was split into training data and testing data in a stratified fashion according to the ratio of 3:1. All samples were normalized by standard deviation after zero-mean standardization. Then, feature selection methods were used to seek the optimal feature combination using training sets and transform all samples for fault-cause classification. ML classifiers were utilized to finish the classification. In the presence of imbalanced data, criteria such as G-mean and accuracy were used to quantitatively assess classification performance. Since G-mean is a metric within biclass concepts, its microaverage was computed and adopted. The final results of each metric were calculated as the average of the 5 trials.

Comparison Feature Algorithms
As reviewed in [34], there are many feature selection methods. We conducted comparison experiments between our MVFS and several typical feature selection algorithms, namely Fisher score (F-Score), mutual information (MI), joint mutual information (JMI), joint mutual information maximization (JMIM), ReliefF, Hilbert-Schmidt independence criterion lasso (HSIC Lasso) [35] and recursive feature elimination (RFE). F-Score ranks features through variance similarity calculation, and the same rank can be obtained by analysis of variance (ANOVA). MI ranks features according to values of their mutual information with class labels. JMI and JMIM are developed from MI [36]. RFE ranks and discards features after training a certain kind of classifier. Starting from all features, the elimination process continues until the feature number or output error is settled to a minimum.
The above algorithms are developed for single-view learning and can be used in BSV, FCFS and FSFC directly. Except for RFE, all of them are filter feature selection approaches, as is HMVFS. Besides, the comparison algorithms designed for multiview learning are kernel canonical correlation analysis (KCCA) [24] and discriminant correlation analysis (DCA) [37]. These feature extraction approaches map multiview data into a common feature space so their results are attached to the comparison in FCFS. As for the proposed algorithm, there are two hyperparameters in HMVFS. In the experiments, these hyperparameters α and β were tuned ranging in {10 −2 , 10 −1 , 1, 10, 10 2 , 10 3 } through grid search on the training sets. Moreover, experiments without any feature algorithm were conducted using BSV features and all features, tabbed as RAW_BSV and RAW.

Overall Classification Performance
In this subsection, we compare the mentioned dimension reduction approach on the basis of SVM to verify the effectiveness of multiview learning and HMVFS. Two concatenating rules were applied to FSFC. The first rule tries to keep 1:1 proportion of waveform and contextual features. There is one more contextual feature when the total number is odd. The second rule holds the same proportion of waveform and contextual features as that in HMVFS.
The results in terms of Gmean with different numbers of selected features are shown in Figure 2. By comparing single-view feature selection methods among strategies, we notice that most of them perform best in BSV rather than in FSFC and FCFS. Added fault features from the other view will even degrade their classification, and this indicates that simple concatenation cannot help conventional feature selection methods adapt to multiview classification. A similar conclusion is drawn in [23]. Thus, the introduction of MVL appears vital in particular. HMVFS has comprehensive advantages in the comparison of FSFC and FCFS and achieves the best performance compared with methods in BSV. HMVFS outperforms others in the middle of feature increasing, and its result with 14 selected features is the global or near-global optimum. When features from the other view increase, the performance is degraded to a certain extent, and then it rises to another peak. Most methods in BSV produce a zigzag rise curve and reach their best when almost all view features are selected. They are also inferior to HMVFS in FSFC and FCFS. ReliefF is the best competitor that achieves acceptable performance in different strategies. As for KCCA and DCA, their performance is low. Figure 2 illustrates that HMVFS is more capable of obtaining the best performance combining waveform and contextual features. Due to the limit of yield data condition and fault signature study, irrelevant and redundant features are introduced with increasing feature numbers. This problem is more prominent in the waveform view in both theoretical and experimental studies. The advantage of HMVFS is that it selects features with independent and complementary information of all views, while the single-view methods are easily affected by irrelevant features facing concatenated assembly or meeting the limitation of single-view features. As seen from Figure 2, concatenating and mapping fail to select or transform discriminative features with combined waveform and contextual features. There are two local optimums for HMVFS, and they are better than the performance of competitors, which demonstrates that HMVFS overcomes the negative effect of redundant features in multiview data.

Parameter Sensitivity
Determination of hyperparameters is an open problem for many algorithms. We conducted parameter sensitivity study by testing different settings of parameters α and β. Since these parameters help HMVFS perform hierarchical feature selection, it is clear that HMVFS will be sensitive to parameter change, and this study may reveal a hierarchical feature relationship. The candidate set was {10 −2 , 10 −1 , 1, 10, 10 2 , 10 3 } for each parameter. Classification performance and average running time are recorded and illustrated in Figure 3. It is observed that α = 10 is beneficial to final selection and maintains relatively high classification performance, among which lower β has slight advantages. View importance is different in multiview learning. From the perspective of view importance, when only two views exist and one of them is generally better, acceptable performance can be achieved by one view, and additional features are expected for improvement. High-level feature selection is weak because the other view has relatively more redundant features and will be ignored with higher β. Meanwhile, appropriately higher α enhances low-level feature selection to exploit the most representative features from the unimportant view. Moreover, acceptable performance is achieved with α = 10 −2 , β = 10 2 and α = 10 −1 , β = 10 2 . Highlevel selection is enhanced, and low-level selection is restrained, which results in limited performance approximating in single-view learning and short convergence time.

Comparison between ML Classifiers
In order to investigate the effect of classifiers and explore better identification accuracy, we employed different ML learners to complete fault-cause classification with HMVFS. Owing to space limitation and performance stability, F_Score and ReliefF were used for comparison. The typical individual classifiers CN2, LR, KNN, SVM and ANN, which have been proven effective in fault-cause identification studies, were tested, and the results are presented in this subsection. Ensemble models promote fault-cause identification by combining individual learners [22], so we also explored the performance of various ensemble models, including random forest (RF), AdaBoosting, stacking ensemble and dynamic ensemble. META-DES, DES-Clustering and KNORA-U are dynamic ensemble techniques based on metalearning, clustering and k-nearest neighbors, respectively. Classification models were developed using Python machine learning library, scikit-learn and DESlib. Table 3 presents the best performance for each combination of feature selection methods and classifiers. Considering some data may be similar, AUC is introduced as a supplement criterion, which is derived from receiver operating characteristic (ROC) analysis and calculated as the area under the ROC curve. As seen from the table, HMVFS outperforms F_Score and ReliefF except with LR and ANN. It is observed that HMVFS always takes fewer features to achieve the best performance in the remaining comparisons. In the group of RF, the best scores of F_Score, ReliefF and HMVFS are very close to each other because RF has the ability of variable selection. Thus the features that function in final classification are similar if selected feature subsets are large enough to contain valuable features. Except for mentioned learners, HMVFS has advantages in both score and feature number.
From the perspective of learners, the classification performance improves with the enhancement of model complexity. CN2 as a rule-based learner cannot cope with multiview features to achieve acceptable performance. Individual learners cannot achieve accuracies greater than 0.8, which are apparently inferior to most ensemble models. Among ensemble models, stacking ensemble realizes the best fault-cause identification in this study. The experimental results of ML classifiers indicate that HMVFS is more suitable for classifiers with high generalization and that ensemble models can bring significant improvement for fault-cause identification.

Conclusions
Associated multisource data for transmission line fault-cause diagnosis are divided and extracted as waveform and contextual features in this paper. MVL is introduced to appropriately combine these features for performance improvement. A novel hierarchical multiview feature selection method based on an ε-dragging technique and sparsity regularization is proposed to perform hierarchical feature selection with multiview data. The ε-dragging is applied in the loss function to enlarge sample distance between classes. l 2,1 -norm and F-norm conduct row-wise and view-level selection, respectively, which can be viewed as the low-level and high-level feature selection. We also develop the optimization algorithm and prove its convergence theoretically. The proposed HMVFS is evaluated by comparisons on yield data. The results reveal that HMVFS outperforms conventional feature selection methods in single-view and early fusion strategies. The further experiments concerning ML classifiers also demonstrate the superiority and effectiveness of the proposed method with high generalization learners. This study has shown the combined use of waveform and contextual features with HMVFS can cause significant improvement for fault-cause identification. In future work, more multiview data and further fault signature study are needed to refine the feature pools, and the performance of HMVFS is expected to be further improved.