To classify high-voltage equipment states, it is possible to use a large number of features obtained from various data sources: dissolved gas analyzers; chromatography; thermal imaging; and various parameters of the individual elements and subsystems of the power equipment. In total, more than sixty different features can be used. However, since the features refer to different diagnostic procedures, not all the samples will contain the values of all the features. In addition, if the number of features is large, then the risk of model overfitting increases. Therefore, the problem under consideration requires special attention to the data preprocessing stage; otherwise, it would not be possible to create adequate datasets for training, validation, and testing of the model.
2.3. Gaps and Outliers Processing
In general, missing values can be restored or deleted. In the case of deletion, there are two options: remove gapped features (columns) or remove records with gaps (rows/samples). In the problem under consideration, neither of these strategies can be applied. Restoring always introduces some extra distortion into the dataset: it partly transforms it from real to simulated (synthetic), which reduces the data relevance and the reliability of the obtained results. For another option, if rows and columns that contain gaps are simply removed, then it is likely that the initial dataset will be several-fold decreased and it will be impossible to use it. Deletion is likely to reduce the dataset size for one or more classes too much, so that there are not enough data to train the model adequately. We cannot completely eliminate data recovery, but we can minimize the share of the synthetic data. This study offers a mixed approach to find the balance between the reduction in dataset size and the minimization of the synthetic data. After that, the restoring procedure is applied to the remaining gaps.
Having analyzed the operation history of high-voltage equipment, it becomes clear that in the initial dataset the number of diagnostic data samples corresponding to the transformers in good and satisfactory states will always be several times higher than in bad ones, particularly, in critical ones. This comes from the existing requirements for the reliability and fail-safe nature of the primary power system equipment. Therefore, taking into account extremely low failure rates of the power equipment, it is important to address the minimum required number of records for each of the classes at all the stages of the model design. If this requirement is violated, the model will not be able to generalize features for all the classes. In the example given in
Section 3, transformers are allocated between the following classes: “good”, “satisfactory”, “unsatisfactory”, “faulty”.
The
IQR (interquartile range) is used to describe the scatter of data:
where
Q3 is the third quartile and
Q1 is the first quartile of the feature. After that, the values below the lower outlier cutoff and above the upper outlier cutoff are excluded respectively, with the number of interquartile ranges equal to
k = 1.5 (below
Q1 − 1.5
IQR and above
Q3 + 1.5
IQR, respectively). It is important to note that other criteria can be used to remove outliers, for example, the 5th and 95th percentiles instead of
Q1 and
Q3. Of course, the removal of the outliers should be done consciously, taking into account the specifics of the task under consideration. Otherwise, too many relevant data may be removed.
Gaps and outliers processing consists of five steps, as follows.
Step 3. Removing columns with a large number of gaps.
3.1. Dg3 ← (D2 = NaN)—determining a binary matrix in which the elements indicate which values in D2 are gaps (NaN).
3.2. Counting gaps by columns:
3.3. Visualizing the number of gaps for each feature to decide on the threshold value Thcol.
3.4. u3 ← (CG > Thcol)—determining the numbers of columns that contain many missing values and should be removed.
3.5. D3 ← cutcol(D2, u3)—removing columns with a large number of gaps.
Step 4. Deleting rows with a large number of gaps.
4.1. Dg4 ← (D3 = NaN)—determining a binary matrix in which the element indicates which values in D3 are gaps.
4.2. Counting gaps by rows:
4.3. Visualizing the number of gaps for each feature to decide on the threshold value Throw.
4.4. u4 ← (RG > Throw)—determining the numbers of rows that contain many missing values and should be removed.
4.5. D4 ← cutrow(D2, u4)—removing rows with a large number of gaps.
Step 5. Merging classes.
5.1. Visualizing the distribution of the number of samples in the dataset by classes for subsequent decision making.
5.2. D5 ← j(D4, Thmerge)—merging nearby classes (states).
Step 6. Filling in the gaps.
Cycle 6.1. i = 1, …, n5.
A test is made if the i-th row contains at least one missing value, then the missing value is replaced by the median value for this feature among the objects of the same type (transformer model) and the same class (transformer state).
If ∃j, D5ij = NaN, j = 1, …, m5,
M = D5|(object_type(M) = object_type(D5i) AND class(M) = class(D5i)),
D6ij = median(M∙j).
Step 7. There are cases when the missing values cannot be filled in at Step 6 due to the fact that the dataset does not contain the necessary data, i.e., there are no attribute values for a certain type of object and its state. Therefore, it is needed to remove the rows with the remaining gaps.
7.1. Dg7 ← (D6 = NaN)—determining a binary matrix in which the elements indicate which values in D6 are gaps NaN.
7.2. Counting gaps by rows:
7.3. u7 ← (RG > 0)—determining the numbers of rows that contain many missing values and should be removed.
7.4. D7 ← cutrow(D6, u7)—removing rows.
Step 8. Removing outliers.
8.1. Visualizing the data distribution for each feature to decide on the boundary values blower, bupper for feature rejection, where blower and bupper are vectors of the boundary values for each feature.
8.2. u8 ← (D7 < blower OR D7 > bupper)—determining the numbers of rows that contain at least one missing value and should be removed.
8.3. D8 ← cutrow(D7, u8)—removing rows with detected outliers.
2.4. Feature Transformation and Feature Importance Analysis
Monotonic feature transformation is critical for some algorithms and does not affect others, so in this case it was necessary to analyze the feature distributions. By using a box-and-whiskers diagram, one-dimensional probability distribution can be densely represented in a graphic form. The obtained graphs can be used to estimate the distribution asymmetry coefficient. A large proportion of machine learning algorithms make the assumption that the data are normally distributed. In cases of distribution asymmetry, it is recommended to apply logarithmic transformation; otherwise, the predictive abilities of the algorithm may be deteriorated.
Additional increase in the dataset quality in terms of the training models effectiveness can be achieved by analyzing the features’ collinearity and eliminating redundant features. The analysis is carried out using Spearman’s correlation coefficient. For two features (columns) from a dataset
a1 and
a2, this is calculated as follows:
where
ui is the rank of the
i-th element in
a1 series,
vi is the rank of the
i-th element in
a2 series, and
n is the number of values (the length of the rows
a1,
a2).
If two features have the modulus of the correlation coefficient |ρ| close to 1, then one of the features should be excluded from the dataset.
It is possible to define redundant or uninformative features using:
Collinearity (correlation) analysis of features based on the Spearman correlation coefficient matrix (cross-correlation of the features);
Analysis of Spearman’s correlation coefficients of the features in relation to the target variable (class);
Preliminary training of several machine learning models that evaluate the importance of the features during the solution process.
The features’ collinearity and importance analysis consists of three steps.
Step 9. Changing the features’ distribution.
9.1. The vector of the features’ numbers t is determined for the transformation.
9.2. D9ij = log10(D8ij + 0.0001)|j ∈ t, i = 1, …, n8, j = 1, …, m8.
Step 10. Assessing the features’ importance using correlation analysis and building a decision tree-based ensemble classification model.
10.1. C = corr(D9)—creating a correlation coefficient matrix of the features.
10.2. Selecting features that can be excluded, taking into account the correlation coefficient value, the vector u10.
10.3. Constructing the classifier φ(D9), which gives the feature importance vector v, checking that features from the vector u10 can be excluded without classification accuracy degradation.
10.4. D10 ← cutrow(D9, u10)—removing features selected as a result of the correlation analysis.
2.5. Second Iteration of the Algorithm
The peculiarity of the proposed approach is the iterativeness of the data preprocessing. After performing the collinearity analysis and after building the preliminary models, as well as obtaining an estimate of the features’ importance, a decision is made to exclude redundant or uninformative features. At the same time, at the stage of removing the missing values, some samples (rows) could be deleted due to the data gaps. Therefore, after excluding redundant and uninformative features, the Throw threshold value should be revised, so that some previously deleted samples may be returned back to the dataset. Further, these samples will be used for restoring the remaining gaps, analyzing collinearity, and training the models. This technique can significantly increase the number of samples in the dataset, since gapped samples that are actually redundant or uninformative will not be deleted.
At the 2nd iteration, Steps 1 and 2 are skipped, and the initial dataset taken for the 2nd iteration is not D, but D10.
The 1st iteration, preliminary data cleaning, is needed to determine the features that can be excluded at the very beginning of the 2nd iteration, i.e., those features that are not informative for this task. In this case, gaps and outliers no longer affect the execution of the 2nd iteration, the main phase of data cleaning. This double-step data processing will allow us to solve the following two problems:
2.6. Machine Learning Models
A decision tree-based ensemble was used as a basic machine learning model. Decision trees are the most interpretable among machine learning models, since they follow logical rules, can deal with quantitative and categorical features, and do not require feature normalization. Decision-tree algorithms are deterministic and fast. However, the generalizing ability of one decision tree for the problem under consideration is not enough; therefore, it is necessary to use ensembles of trees. An effective ensemble-building algorithm is boosting, i.e., the sequential creating of the models and adding them to the ensemble, each of which seeks to reduce the current ensemble error. When using supervised learning on the dataset
D = {(
xi,
yi):
xi ∈
Rn,
yi ∈
N}, the ensemble of
k decision trees will be formulated as follows:
where
yi is the output (prediction) of the model,
Xi is the input of the model,
fj(
X) is an individual decision tree of the ensemble,
wj is the weight of the tree, which sets its significance when combining the results of all decision trees, and
k is the number of trees.
In the presented study, three boosting algorithms (AdaBoost, XGBoost, CatBoost) are considered. Other models and machine learning algorithms are also used for comparison of the results.
As with most heuristic methods, for ensemble algorithms, it is necessary to adjust the hyperparameters, the main ones being the tree depth and the number of trees. Setting the parameters manually is very laborious, therefore the random search approach was used, which enumerates the parameter values randomly.