Power Transformer Fault Detection: A Comparison of Standard Machine Learning and autoML Approaches

: A key component for the performance, availability, and reliability of power grids is the power transformer. Although power transformers are very reliable assets, the early detection of incipient degradation mechanisms is very important to preventing failures that may shorten their residual life. In this work, a comparative analysis of standard machine learning (ML) algorithms (such as single and ensemble classification algorithms) and automatic machine learning (autoML) classifiers is presented for the fault diagnosis of power transformers. The goal of this research is to determine whether fully automated ML approaches are better or worse than traditional ML frameworks that require a human in the loop (such as a data scientist) to identify transformer faults from dissolved gas analysis results. The methodology uses a transformer fault database (TDB) gathered from specialized databases and technical literature. Fault data were processed using the Duval pentagon diagnosis approach and user–expert knowledge. Parameters from both single and ensemble classifiers were optimized through standard machine learning procedures. The results showed that the best-suited algorithm to tackle the problem is a robust, automatic machine learning classifier model, followed by standard algorithms, such as neural networks and stacking ensembles. These results highlight the ability of a robust, automatic machine learning model to handle unbalanced power transformer fault datasets with high accuracy, requiring minimum tuning effort by electrical experts. We also emphasize that identifying the most probable transformer fault condition will reduce the time required to find and solve a fault.


Introduction
Power transformers are key components for transmission and distribution grids.Although transformers are very reliable assets, the early detection of incipient degradation mechanisms is very important to prevent failures that may shorten their life span [1][2][3].The life cycle management of the power transformers is composed of several stages, such as transformer specifications, erection, commissioning, operation, maintenance, and endof-life operations.In particular, for the last two stages, it is of paramount importance to have suitable tools for assessing a power transformer's condition.Economic consequences of a power transformer's catastrophic failure include (i) costs for the lost transmission of electricity and (ii) repairmen or substitution costs for the faulted power transformer, which can vary according to the electrical system power level, substation topology, and technical characteristics of the transformer.For example, consider the case of lost transmission capability due to a single-phase transformer failure rated at 230 kV, 33 MVA, located somewhere in Mexico.The economic impact is composed of (i) the costs of transmission loss, which rise up to USD 6,177,600 (since the cost for loss of transmission is around 2.6 USD/kWh in Mexico), and (ii) the direct costs, including a 72 h affectation window (firefighting, damaged facilities' repair, soil remediation operations, reserve transformer testing and commissioning, and fitting all the substations' and systems' prior conditions) with a direct cost around of USD 1,280,000.Therefore, grid operators and utilities require tools that allow them to optimize their decision-making processes regarding transformer repair, refurbishment, or replacement under the umbrella of costing, reliability, and safety optimization [4][5][6].
Condition assessment (CA) is the process of identifying markers and indexes to determine and quantify the degradation level of transformer components [1,7,8].Power transformer CA strategies include exhaustive electrical and physicochemical testing, online and/or offline diagnosis techniques, analysis of the operation and maintenance parameters, and the use of condition-based strategies supported by suitable standards and expert knowledge.In fact, an expert assessment is the most effective but costly and time-consuming CA strategy.It requires taking transformers offline and the participation of highly qualified experts to carry out the analysis, increasing the process costs.Thus, utilities are looking forward to more cost-effective CA strategies where few or zero expert interventions are required.
One of the main steps of a transformer CA is the identification of faults through a transformer fault diagnosis (TFD) procedure.The TFD focuses on the transformer insulation system, whose integrity is fully related to transformer reliability [9,10].The insulating system is exposed to electrical, mechanical, and thermal stresses.These phenomena would be considered normal if they were considered during the transformer's design; otherwise, they are considered abnormal.Among the abnormal behaviors are emergency overloading, arc flashes, transient events, and thermal faults, to mention a few [11,12].The transformer insulation system has two main components: the insulating fluid (commonly mineral oil) and the solid insulation (kraft paper, pressboard, and other materials).Oil plays a very important role in providing highly reliable insulation characteristics and working as an efficient coolant, removing the heat generated at the core and windings during transformer operation [13].Further, insulating oil analysis can provide important information regarding transformer degradation and behavior at a very low cost, eliminating the need to carry out expensive offline testing.
Transformer insulating oil is a petroleum-derived liquid that can be based on paraffinic, naphthenic, naphthenic-aromatic, or aromatic hydrocarbons.No matter its structure, insulating oil can be decomposed by abnormal stresses (discharges and heating), producing dissolved byproduct gasses correlated to specific faults.Hence, dissolved gas analysis (DGA) is a widely studied diagnostic technique for which many tools are already available.These tools are based on the analysis of each byproduct gas, its concentration, and the interrelationship between them.Among the most classical methods to diagnose oil samples are the Rogers ratio, IEC ratio, Dornenburg ratio, key gas method, Duval triangles [2,[12][13][14][15][16], and Duval pentagons [4,17], to mention a few.Most of those methods are based on dissolved gas ratio intervals that classify transformers into different faults.However, these methods are prone to misinterpretations near the fault boundaries [15,16].Furthermore, classical DGA methods always identify a fault, even when there is not one; thus, expert assessment is still required to accurately determine if there is a fault or not.On the other hand, coarse DGA-based fault classification methods have a high accuracy rate but have poor usability, whereas fine TFD can be used for decision-making, but its accuracy rate is lower [15].In general, to decide whether to remove, repair, or replace a transformer in the presence of thermal faults, the fault severity must be determined [17]; thus, finer TFD is preferred.An important avenue for TFD methods is machine learning (ML).Databased algorithms have been proposed to improve TFD performance while avoiding the drawbacks mentioned earlier.ML methods provide high flexibility: they are able to handle linear and nonlinear relations, are robust to noise, do not necessarily require taking into account thermodynamic phenomena, and provide high fault diagnosis performance [18].ML algorithms that have been used for the TFD endeavor can be divided into supervised and unsupervised approaches.Supervised ML employs different gas-in-oil ratios already diagnosed by experts using chromatographic results, to build a function that relates those gas ratios with transformer faults or normal or faulty status.Unsupervised approaches employ dissolved gas data to cluster the transformers into groups whose gas ratios are similar to each other.Nevertheless, an expert's diagnosis is always required to assess the performance of the models; thus, this study highlights the supervised approach.Most of the ML works applied to the TFD problem cover one or more of the following steps: 1.
Model overfitting assessment: The problem of overfitting the TFD classifiers has been handled through the usage of classical [1,13,20] and stratified cross-validation (s-CV) [24].7.
Even while many works have delved into the usage of ML algorithms for the TFD problem, these present one or more shortfalls, such as (i) training and testing their methods using small datasets; (ii) carrying out comparisons using only standard ML supervised algorithms; (iii) considering only coarse fault types by setting aside fault severity (not to mention that none of the reviewed works considered fault severity as defined by the Duval pentagon method); and (iv) a lack of publicly available data.These issues stand in the way of us obtaining a clear idea of which sequence of methods and algorithms provides the best performance for the TFD problem.Also, it makes the reproducibility of the research results difficult and hinders the deployment of ML solutions to solve the TFD problem of real-world utilities.
The construction of high-performance ML pipelines, regardless of their application, requires the involvement of data scientists and domain experts.This allows us to incorporate domain knowledge into the design of specialized ML pipelines (i.e., the sequence of data pre-processing, domain-driven feature selection and engineering, and optimized ML models for a given problem [26]).However, the construction of specialized ML pipelines using this approach is long-winded, expensive, complex, iterative, and based on trial and error.This analysis (and the related works) reveals the difficulty associated with operational process experts building intelligent models.These power systems experts can be easily overrun by the selection and combination of ever-growing alternatives of pre-processing methods, ML algorithms, and their parameter optimization, for the solution of the TFD problem.Under these circumstances, the probability of obtaining a final ML pipeline that behaves sub-optimally is higher [26,27].Hence, there is a growing need to provide power systems technicians with ML tools that can be used straightforwardly to solve power systems problems (e.g., TFD).The approaches used for automatically (without human intervention) and simultaneously obtaining a high-performing combination of data pre-processing, learning algorithm(s), and a set of hyperparameters are branded automatic machine learning (autoML) [27,28].autoML comprises promising approaches that may be used off the shelf for solving the TFD problem of real-world industries.
Therefore, in this work, we present a deep comparative analysis of a large and supervised ML algorithm pool composed of single, ensemble, and autoML classifiers applied to the TFD problem.The purpose of this review is to compare algorithms' performance for the TFD problem under equal experimental settings, by (i) compiling and sharing a transformer fault database (TDB) of the main dissolved gas data of 821 transformers, and their corresponding diagnostics, (ii) using single and ensemble ML algorithms, as well as state-of-the-art autoML frameworks, to solve the TFD problem, and (iii) solving a real-world TFD multi-classification problem using, for the first time (to the best of authors' knowledge), Duval pentagons' fault and severity classes [29].In doing so, this analysis improves our comprehension of the ML approaches available for the TFD problem, and it gives a view of how much automation can we expect for the TFD problem, particularly when fault severity is taken into consideration.
The structure of this work is organized as follows: The introduction is presented in the first section.The second section presents a detailed definition of materials and methods used for comparative analysis of standard ML and autoML algorithms.The third section outlines the results and discusses the outcomes obtained in the fault diagnosis of power transformers.The conclusion of this work is shown in the fourth section.

Materials and Methods
The complete ML applied in the present work for the multi-class TFD problem is presented in Figure 1.For comparison, we termed the part of the pipeline corresponding to single and ensemble classifiers as the standard ML framework, and the part of the pipeline corresponding to autoML as the autoML framework.Furthermore, we specify a shared pipeline for both ML approaches.The overall ML system consists of five major sections: 1.
Data recollection and labeling.In this step, we transformed dissolved gas-in-oil and conducted corresponding diagnostics.We double-checked transformers' diagnostics: first using the Duval pentagons method to obtain the fault severity (if not available), and then using the IEEE C57.104-2019 standard and expert validation to identify normal operating transformers.

2.
Initial pre-processing.In this step, we pre-processed gas-in-oil information using several methods found in the literature, namely, the replacement of zero measurements, natural logarithm escalation, and derivation of key gas ratios.parameters from single and ensemble classifiers using a grid search (GS) and cross-validation (CV) procedures.b.
AutoML framework.In this step, the code automatically carried out a warmstart procedure, additional data and feature pre-processing methods, classifier optimization, and ensemble construction.

5.
Measuring the test error using several multi-class performance measures.In this step, we evaluated the algorithms comprehensively using several multi-class performance measures such as the κ score, balanced accuracy, and the micro and macro F1-measure.
3. Data separation into training (i.e., Xtrain and Ytrain) and testing (i.e., Xtest and Ytest) datasets.For this splitting, we considered the number of samples in each class, to avoid leaving classes unrepresented in any of the datasets.

4.
Training the ML system: a.Standard ML framework.In this step, we carried out a second data pre-processing stage, training, and parameter optimization.We optimized the parameters from single and ensemble classifiers using a grid search (GS) and cross-validation (CV) procedures.b.AutoML framework.In this step, the code automatically carried out a warmstart procedure, additional data and feature pre-processing methods, classifier optimization, and ensemble construction.

5.
Measuring the test error using several multi-class performance measures.In this step, we evaluated the algorithms comprehensively using several multi-class performance measures such as the κ score, balanced accuracy, and the micro and macro F1-measure.

DGA Data
In this work, we constructed a transformer fault database (TDB) comprising 821 samples using different bibliographic sources.These samples were obtained from a specialized database and technical literature: from the International Council on Large Electric Systems (CIGRE), from IEEE [16], technical papers [17, [30][31][32][33][34], a CIGRE technical brochure [35], and expert curation.For each transformer, we collected its five thermal hydrocarbon gases and, when reported, their corresponding diagnostics.The collected gases were hydrogen (H 2 ), methane (CH 4 ), ethane (C 2 H 6 ), ethylene (C 2 H 4 ), and acetylene (C 2 H 2 ).When available, we recovered the associated diagnostics from the bibliographic sources.Otherwise, we obtained those by means of an analysis method.In this paper, we selected the Duval pentagons as our analysis method [17,29] since it offers not only fault types but also the severity for thermal faults.It is important to note that, in some cases, this analysis method was also used to confirm the literature-provided diagnostics.
According to [29], the Duval pentagons method first calculates the relative percentage ratios by dividing the concentration of each gas by the total gas content (TGC).Then, the five relative percentage gas ratios are plotted in their corresponding axis in the Duval pentagon, yielding a non-regular five-sided polygon.The centroid of the irregular polygon provides the first part of the diagnostic by indicating the region of the pentagon where it is located.The diagnostic faults available (regions) in the first Duval pentagon are partial discharges (PDs), low and high energy discharges (D1 and D2, respectively), thermal faults involving temperatures less than 300 • C (T1), thermal faults with temperatures ranging from 300 to 700 • C (T2), and thermal faults involving temperatures higher than 700 • C.There is an additional region in the first pentagon called the stray gassing region (S), which reveals another type of gas generation mechanism.Stray gassing is associated with relative low temperatures, oxygen presence, and the chemical instability of oil molecules caused by a previous hydrogen treatment, whose scope is the removal of impurities and undesirable chemical structures in mineral oils.The second part of the Duval pentagons method allows the user to refine the diagnostics by providing advanced thermal diagnostic options: hightemperature thermal faults that occurs in oil only (T3-H), different temperature thermal faults involving paper carbonization (T1-C, T2-C, and T3-C), and overheating (T1-O).
However, all the available classical TFD methods (including the Duval pentagon method) always provide a diagnostic, despite gas concentrations perhaps being too low.To avoid false positives, we used the IEEE C57.104-2019 standard [36] along with expert experience to tag the corresponding transformers with a normal condition diagnostic in these cases.The resulting class distribution for the TDB is shown in Table 1.

Initial Pre-Processing of DGA Data
Before any TFD can be carried out, either with the standard ML or the autoML framework, the TDB must be initially pre-processed.This pre-processing stage consists of three steps: (i) the replacement of zero measurements, (ii) the scaling of measurement values using the natural logarithm (ln) function, and (iii) the derivation of features from dissolved gas ratios.The main reasons for carrying out an initial data pre-processing stage are twofold.On one hand, data pre-processing methods improve the performance of standard ML frameworks for the TFD problem [1,2,5,13,15,22,24].On the other hand, autoML frameworks perform better at ML model selection and hyper-parameter optimization (HPO) than the feature engineering (i.e., creation) and data pre-processing methods [26,37].Furthermore, the selected autoML algorithm used in this work does not consider the preprocessing methods used in the proposed pipeline, nor a feature engineering method that can derive dissolved gas ratios from TDB sample measurements.
The initial pre-processing of DGA data was as follows: First, we considered gas measurements with reported values of zero, as that is below the limit of detection of the chemical procedure analysis.Thus, for the zero measurements, we assumed a small constant value for mathematical convenience (i.e., 1), or an even smaller constant for C 2 H 2 (i.e., 0.1).Second, we scaled gas values using the natural logarithm function.This process is widely suggested for scaling features with positively skewed distributions (i.e., heavy tails), which improves their normality and reduces their variance [38].Third, we conducted a feature engineering process consisting of the estimation of different ratios from transformed gas values.The relationship between fault types and proportions of dissolved gases in the insulating system has been exploited in traditional DGA methods [9,24,30].Therefore, several relative ratios based on CH 4 , C 2 H 6 , C 2 H 4 , C 2 H 2 , and H 2 were derived.We used these relative ratios as derived features, which are shown in Table 2.In this table, THC (total hydrocarbon content) is the sum of hydrocarbon gas contents, whereas TGC (total gas content) is the total amount of dissolved gas content in the transformer oil.

Splitting Data and Training the ML System
Once data were initially prepared, we split them into training and testing datasets.For this split, we considered the proportion of the classes, so each fault type was represented in both datasets, training (X train , Y train ) and testing (X test , Y test ).The proportions used for splitting the TDB were 70% for training and 30% for testing.Both subsets kept the same class distribution ratios as in the full TDB, to assess classifiers' performance with imbalanced datasets.Afterwards, the ML systems were trained.Before delving into the details of both ML frameworks (standard and autoML), it is worth mentioning that a second stage of data pre-processing was considered to avoid carrying out the same data pre-processing method (i.e., standardization) twice in the autoML approaches.

Standard ML Framework
The standard ML framework follows a classical pipeline: (i) data pre-processing, (ii) selection of the classifier (either single or ensemble), and (iii) optimization of classifier parameters (using a GS-CV procedure).To complete the data pre-processing treatment, we standardized TDB gas measures by subtracting the mean and scaling values by their variance.Next, we selected a classification algorithm, either a single (ANN, DT, Gaussian processes (GPs), naive Bayes (NB), KNN, LR, and SVM) or an ensemble algorithm.The main difference between single and ensemble classifiers is that the first produces a robust model with a good generalization, whereas the second employs several instances of the same classifier.Usually, the classifiers composing the ensemble perform slightly better than a random classifier (e.g., by overfitting), and by using different combining strategies, a good generalization is attained.Among the ensemble strategies are boosting (bagging classifier (BC), histogram (HGB) and extreme (XGBoost) gradient boosting), bagging (random forest (RF)), and stacking (SE).The stacked ensemble is a particular case where two or more strong classifiers are sequentially chained.For this study, an ANN followed by SVM was employed.
Single and ensemble classifiers have been neatly discussed elsewhere; however, for the sake of completeness, they are briefly detailed in Appendices A.1 and A.2, respectively.Meanwhile, in Table 3, the parameters employed by single and ensemble classifiers are presented.The optimal values were estimated using a grid search cross-validation procedure with k = 5 folds.

AutoML Framework
AutoML tools are frameworks whose main purpose is to make ML available for people who are not ML experts.Among these tools, we selected the auto-Sklearn algorithm, which is one of the first autoML frameworks and provides robust and expert-competitive results in several ML tasks [26,28,39].auto-Sklearn is an algorithm based on the Python scikit-learn (Sklearn) library [40].It is employed for building classification and regression pipelines by searching over single and ensemble ML models.This algorithm explores semi-fixed structured pipelines by setting an initial fixed set of data cleaning steps.Then, a sequential model-based algorithm configuration (SMAC) using Bayesian optimization in combination with a random forest regression allows the selection and tuning of optional pre-processing and mandatory modeling algorithms.In addition, auto-Sklearn provides parallelization features, meta-learning to initialize the optimization procedure, and ensemble learning through the combination of the best pipelines [26,28,39].
To improve the analysis between standard ML and autoML frameworks, two au-toML versions are considered, namely, vanilla auto-Sklearn and robust auto-Sklearn models.The main differences between them are that (i) the vanilla model only considers a single regression model whereas the robust model employs an ensemble, and (ii) the vanilla model does not employ the meta-learning warm-start stage to initialize the optimization procedure, whereas the robust model does.In this sense, the vanilla model serves as a baseline for the autoML framework.autoML classifiers have been discussed elsewhere; however, for the sake of completeness, they are detailed in Appendix B.

Classification Performance Metrics
To compare the performance of the standard ML and the autoML frameworks, we employed several multi-classification metrics.As mentioned before, several classification metrics have been employed for the analysis of algorithms' performance for the TFD problem (i.e., the accuracy percentage, confusion matrix, the area under the receiver operating characteristic (AUCROC) and precision-recall (AUCPR) curves, and the micro and macro F1-measure).However, neither the accuracy percentage nor the AUCROC is sensitive to class imbalance.Further, neither the AUCROC nor the AUCPR is suitable for analyzing a multi-classification problem.Therefore, in this work, we employed the confusion matrix (CM), the balanced accuracy (BA), the F1-measure (F1) using micro and macro averages, Cohen's kappa (κ) metric, and Matthews' correlation coefficient (MCC).
On one hand, the CM is a tool to understand the errors of classifiers in binary, multiclass, and even multi-label scenarios.On the other hand, the remaining performance metrics used in this work are obtained from the CM.The selected metrics are useful for assessing the overall performance of a classifier in a multi-class problem.From these, MCC and κ (and, in a lesser sense, F1-macro) are more robust than the remaining for assessing the expected performance of classifiers in the presence of class imbalance.

Software
We conducted all the experimentation required for TFD ML algorithms' comparison, i.e., pre-processing, training, and testing, using the Python programming language in a Jupyter notebook.We used standard Python packages, such as numpy [41] and pandas [42], for the initial pre-processing stages.For training the classical and most of the ensemble ML algorithms, we employed the sklearn [40] package (in the case of xGB, the xgboost [43] package was used).For the autoML case, we used the autosklearn package [28].The computer notebook is available in a GitHub repository.
It is worth noting that, while it would be a good idea to use the MCC and κ as a cost function for training the algorithms, due to sklearn package limitations, algorithms' training cost function is restrained to the F1-macro.

Results
This section presents the TFD classification results obtained for algorithms of the standard ML and the autoML frameworks.For each classifier, we calculated five (5) performance metrics (as described in the above section).Using those metrics, we carried out a quantitative comparative analysis to determine the best algorithm(s).For a deeper analysis of the performance of the rest of the algorithms, we carried out a multi-objective decision-making (MODM) comparison.Afterward, through the CM, we analyzed class imbalance, false positives, and false negatives of the best-performing algorithm.

Overall Classifier Performance for the TFD Problem
In Table 4, we present the performance of standard ML and the autoML frameworks' results for the five quality metrics.We highlight the best performing solutions in bold.It can be observed that, in general, the best-performing algorithm is the robust auto-Sklearn model for the five quality metrics.This model outperformed the rest of the algorithms, particularly for the F1-macro measure, where the closest competitors (ANN and SE models) attained approximately 10% lower F1-macro scores.These results show the ability of the robust auto-Sklearn model to handle an imbalanced TDB, providing the highest classification performance among all the tested algorithms, needing the minimum tuning effort from the humans in the loop (i.e., electrical experts carrying out a TFD).Therefore, the robust auto-Sklearn model seems preferrable as an off-the-shelf solution for the TFD problem.

Analysis of the Frameworks' Performance
The above results show that the robust auto-Sklearn model (autoML algorithm) is the best-performing algorithm in the TFD problem using the TDB.However, it is not clear how worse the performance levels of the remaining algorithms were in comparison.Also, there might be cases where using the robust auto-Sklearn model is not possible due to issues related to model explainability, training computational cost, productizing models, or other business-related issues raised by utility stakeholders.In such scenarios, it would be useful to determine if the vanilla auto-Sklearn model (or another single autoML framework such as auto-WEKA [36]) is better or worse than single/ensemble standard classifiers.When considering the results for the ANN and SE algorithms, we found that these were better for the five metrics in comparison to the vanilla auto-Sklearn model.Similarly, other single and ensemble algorithms (such as SVM and HGB) performed better than the vanilla model for F1-micro, κ, and MCC.To improve the performance comparison, metric results for each algorithm were transformed using the vanilla auto-Sklearn result as a baseline, as follows: where Mi(A) corresponds to the i metric result for algorithm A, Mi(auto-Sklearn vanilla) corresponds to the i metric result for the vanilla auto-Sklearn model, and Ni(A) corresponds to the baseline transformed value for the i metric and algorithm A. For instance, for BA and ANN, the baseline transformed value B A(ANN) is obtained, such as 1 − BA(auto−Sklearnvanilla)

BA(ANN)
. The transformed values can be interpreted as follows: an Ni(A) > 0 value implies that the performance of the A algorithm is better than the vanilla auto-Sklearn algorithm.In contrast, if Ni(A) < 0, then the A algorithm's performance is worse than the vanilla auto-Sklearn algorithm.
Once metric values were transformed, we carried out an MODM comparison.MODM deals with problems where two or more performance criteria are used together to make a decision: in our case, we were looking for an algorithm capable of identifying specific electric transformer faults, as accurately as possible, in terms of five performance metrics.In an MODM, model quality is defined by a n-dimensional vector where n corresponds to the number of metrics used.Hence, an algorithm solving an MODM must consider either a way to simplify a vector of quality metrics into a single scalar, or a way to handle multiple objective functions all at once.
Regarding the methods that solve multiple objective functions, they all use the Pareto approach (PA) [44].In the PA, instead of handling the burden of collapsing multiple metrics into a single value, you instead look to find a set of solutions (e.g., TFD classification algorithms) that are non-dominated.To define this concept, it is easier first to define the opposite, i.e., dominance.A solution si is said to dominate a solution sj if si is strictly better than sj in at least one of the quality metrics ci, i = 1,. . ., n, and equal or better in the remaining metrics.Formally, this comprises (i) ∃ci|ci(si) > cj(si) and (ii) ∀ci|ci(si) ≥ cj(si) (where ci(si) stands for the quality metric value for solution si) [44].On the other hand, two solutions si and sj are said to be non-dominating with respect to each other if (i) quality metric values for solution si are strictly better than sj in at least one of the ci, i = 1,. . ., n, and (ii) quality metric values for solution si are strictly worse than sj in at least one of the quality metrics ci, i = 1,. . ., n.The set of non-dominated solutions is also known as the Pareto frontier.In Figure 2, the Pareto analysis carried out on the vanilla transformed quality metrics, excluding the robust auto-Sklearn model, is shown.Observe that the vanilla auto-Sklearn model is shown at the origin (0,0); algorithms in the Pareto frontier are depicted in red, whereas the worst-performing algorithms are displayed in blue.From this figure, note that the SE, ANN, and GP algorithms performed better than the vanilla auto-Sklearn (for BA, the improvements were 3%, 3%, and −10%, respectively, whereas for κ, the improvements were 3%, 2%, and 4.5%, respectively).Hence, and without considering the robust auto-Sklearn algorithm, either of these can be selected for the TFD problem.On the other hand, HGB and SVM, while they performed better for the κ metric than the vanilla auto-Sklearn (3% and 2%, respectively), could be considered as good as the vanilla auto-Sklearn model in a Pareto front sense (and, in a lesser sense, the RF case).The remaining algorithms should be considered to have performed worse than the vanilla auto-Sklearn model.Specifically, the LR and NB algorithms performed considerably worse than the vanilla model: for the BA metric, 17% and 14% worse, and for the κ metric, 5% and 14%, respectively.In summary, single autoML frameworks provide a good identification of transformer faults with minimal human intervention; still, standard ML approaches such as ANN, SE, or GP classifiers would provide better results for the TFD problem.

Transformers' Fault Diagnosis in Detail
In accordance with the above results, the overall best-performing algorithm for the TFD problem is the robust auto-Sklearn (AutoML) algorithm.But how was its performance for each transformer fault type?And how did its performance compare against one of the algorithms belonging to the Pareto frontier such as the SE algorithm?In Figure 3, we present the confusion matrix for both algorithms: in Figure 3a, the robust auto-Sklearn is shown, whereas in Figure 3b, the SE is displayed.It can be observed that, in general, for both algorithms, most fault types were identified with a good (≥80%) to very good (≥90%) accuracy, except for the following: in (a) for PD and S, with an accuracy of 71% and 78%, respectively; in (b) for S, T2-C, and T3-C, with an accuracy of 78%, 71%, and 75%, respectively.To examine the regular performance on these fault types, it is useful to recall that Observe that the vanilla auto-Sklearn model is shown at the origin (0,0); algorithms in the Pareto frontier are depicted in red, whereas the worst-performing algorithms are displayed in blue.From this figure, note that the SE, ANN, and GP algorithms performed better than the vanilla auto-Sklearn (for BA, the improvements were 3%, 3%, and −10%, respectively, whereas for κ, the improvements were 3%, 2%, and 4.5%, respectively).Hence, and without considering the robust auto-Sklearn algorithm, either of these can be selected for the TFD problem.On the other hand, HGB and SVM, while they performed better for the κ metric than the vanilla auto-Sklearn (3% and 2%, respectively), could be considered as good as the vanilla auto-Sklearn model in a Pareto front sense (and, in a lesser sense, the RF case).The remaining algorithms should be considered to have performed worse than the vanilla auto-Sklearn model.Specifically, the LR and NB algorithms performed considerably worse than the vanilla model: for the BA metric, 17% and 14% worse, and for the κ metric, 5% and 14%, respectively.In summary, single autoML frameworks provide a good identification of transformer faults with minimal human intervention; still, standard ML approaches such as ANN, SE, or GP classifiers would provide better results for the TFD problem.

Transformers' Fault Diagnosis in Detail
In accordance with the above results, the overall best-performing algorithm for the TFD problem is the robust auto-Sklearn (AutoML) algorithm.But how was its performance for each transformer fault type?And how did its performance compare against one of the algorithms belonging to the Pareto frontier such as the SE algorithm?In Figure 3, we present the confusion matrix for both algorithms: in Figure 3a, the robust auto-Sklearn is shown, whereas in Figure 3b, the SE is displayed.It can be observed that, in general, for both algorithms, most fault types were identified with a good (≥80%) to very good (≥90%) accuracy, except for the following: in (a) for PD and S, with an accuracy of 71% and 78%, respectively; in (b) for S, T2-C, and T3-C, with an accuracy of 78%, 71%, and 75%, respectively.To examine the regular performance on these fault types, it is useful to recall that when analyzing the performance of an algorithm using the multi-class CM (see section Appendix C), rows indicate false negatives (FNs) and columns indicate false positives (FPs), respectively.Thus, for the case of the robust auto-Sklearn algorithm, PD faults were misclassified 29% of the time as S fault types; while S faults were misclassified 19% of the time as T1-O faults and 3.7% of the time as a normal condition.For the case of the SE algorithm, S faults were misclassified 22% of the time as a T1-O faults; T2-C faults were misclassified 14% of the time as T1-O and T1-C faults; and T3-C faults were misclassified 12% of the time as T3-H and S faults.From all these errors, the robust auto-Sklearn algorithm incurs in the most expensive ones (i.e., classifying a fault as a normal condition).Further, the misclassification from both algorithms can be attributed to the fault regions described by these for each fault type.Those do not necessarily match the Duval pentagon fault regions, which are geometrically contiguous and do not overlap [17].In addition, recall that all of these classes, i.e., PD, S, T2-C, and T3-C, are underrepresented in the TDB (see Table 1).In the light of these findings, we can conclude that samples misclassified may lay at the class limits, and/or class boundaries found by the algorithms have a different geometric shape than the one defined by the Duval pentagon.Therefore, increasing the sample size of imbalanced classes (either real or synthetic samples) should be useful for improving the boundaries defined in the feature space for each class by both algorithms.Finally, it is worth noting that both algorithms classified with 100% accuracy the low thermal faults involving paper carbonization (i.e., T1-C), which is the most underrepresented class in the TDB.12% of the time as T3-H and S faults.From all these errors, the robust auto-Sklearn algorithm incurs in the most expensive ones (i.e., classifying a fault as a normal condition).Further, the misclassification from both algorithms can be attributed to the fault regions described by these for each fault type.Those do not necessarily match the Duval pentagon fault regions, which are geometrically contiguous and do not overlap [17].In addition, recall that all of these classes, i.e., PD, S, T2-C, and T3-C, are underrepresented in the TDB (see Table 1).In the light of these findings, we can conclude that samples misclassified may lay at the class limits, and/or class boundaries found by the algorithms have a different geometric shape than the one defined by the Duval pentagon.Therefore, increasing the sample size of imbalanced classes (either real or synthetic samples) should be useful for improving the boundaries defined in the feature space for each class by both algorithms.Finally, it is worth noting that both algorithms classified with 100% accuracy the low thermal faults involving paper carbonization (i.e., T1-C), which is the most underrepresented class in the TDB.

Conclusions
This paper has presented a comprehensive review and comparative analysis of standard machine learning algorithms (such as single and ensemble classification algorithms) and two automatic machine learning (autoML) classifiers for the fault diagnosis of power transformers.The primary objective of this study was to compare the performance of classical ML classification algorithms, which require human-in-the-loop experts for tuning, with two autoML approaches that demand minimal human operation.To achieve this, data of transformer faults were collected from the literature, as well as from databases

Conclusions
This paper has presented a comprehensive review and comparative analysis of standard machine learning algorithms (such as single and ensemble classification algorithms) and two automatic machine learning (autoML) classifiers for the fault diagnosis of power transformers.The primary objective of this study was to compare the performance of classical ML classification algorithms, which require human-in-the-loop experts for tuning, with two autoML approaches that demand minimal human operation.To achieve this, data of transformer faults were collected from the literature, as well as from databases from both Mexican and foreign utilities and test laboratories.Subsequently, raw data were curated, and faults were validated and assigned using both the Duval pentagon method and expert knowledge.The methodology used for comparison included: (i) several pre-processing steps for feature engineering and data normalization; (ii) different ML approaches (single ML and ensemble algorithms were trained and tuned using a GS-CV by a data scientist, whereas the autoML models were trained and tuned using Bayesian optimization in combination with a random forest regression with zero human intervention); (iii) several algorithm performance approaches using global metrics, a Pareto front analysis, and a CM to have a detailed look into the types of biases algorithms suffer.A key contribution of this work is that, for the first time (to the best of the authors' knowledge), it has defined fault classes using Duval pentagons and severity classes.
Our results showed that the robust auto-Sklearn achieved the best global performance metrics over standard single and ensemble ML algorithms.On the other hand, the PA showed that the vanilla autoML approach performed worse than some single (ANN, SVM) and ensemble (SE, HGB, GP, and RF) ML algorithms.The CM revealed that, while the robust auto-Sklearn algorithm obtained the highest global performance metric values, it misclassified some faults as a normal condition.This type of error can have a very negative impact on power grid performance (blackouts) with high financial costs.The misclassification can be attributed to the imbalanced TBD.Increasing the sample size of the imbalanced classes (either real or synthetic samples) should be useful for improving the boundaries defined in the feature space for each class.In conclusion, the robust auto-Sklearn model is not only a good off-the-shelf solution for the TFD while handling imbalanced datasets but also achieved the highest global classification performance scores using the minimum tuning effort by a human (i.e., electrical experts carrying out a fault diagnosis).This comparative analysis has extended our comprehension of the ML approaches available for the TFD problem, and it has given a view of how much automation we can expect for a real TFD problem, particularly when fault severity is taken into consideration.In future work, the best models (ensemble SE and robust auto-Sklearn) will be incorporated into a power transformer condition assessment in a maintenance management system.It is expected that failure classification indicating the most probable defect will be used to help engineers reduce the time needed to find and repair incipient faults, which will help to avoid catastrophic failures and fires.

Figure 1 .
Figure 1.ML methodology developed for the comparison of single, ensemble, and autoML classifiers for the transformer fault classification problem.Figure 1. ML methodology developed for the comparison of single, ensemble, and autoML classifiers for the transformer fault classification problem.

Figure 1 .
Figure 1.ML methodology developed for the comparison of single, ensemble, and autoML classifiers for the transformer fault classification problem.Figure 1. ML methodology developed for the comparison of single, ensemble, and autoML classifiers for the transformer fault classification problem.

Figure 3 .
Figure 3. Confusion matrix for the (a) robust auto-Sklearn model and (b) for the stacking ensemble algorithms.

Figure 3 .
Figure 3. Confusion matrix for the (a) robust auto-Sklearn model and (b) for the stacking ensemble algorithms.

Table 1 .
Transformer fault class distribution.

Table 2 .
Features derived from dissolved gases.

Table 4 .
Classifiers' performance attained on the transformer fault detection problem.