2.1. Leakage Magnetic Flux in Transformers
In power transformers, not all the magnetic flux produced by the primary winding links the secondary winding and vice versa. A portion of these magnetic flux lines leaves the core through the air without contributing to energy transfer; these losses are referred to as leakage magnetic flux [
12]. This phenomenon depends on factors such as the reluctance of the magnetic circuit and the constructive configuration of the transformer. An inadequate design may increase this type of loss and affect the accuracy of the transformation ratio [
20].
The leakage magnetic flux in a power transformer is a direct consequence of imperfect magnetic coupling between the primary and secondary windings. This leakage magnetic flux is proportional to both the current flowing through the winding and the leakage inductance associated with the physical design of the transformer.
From electromagnetic theory, the inductance relates the current to the flux linkage
, which can be expressed as follows:
where
represents the leakage flux linkage,
is the leakage inductance, and
is the instantaneous current flowing in the winding. The flux linkage is related to the magnetic flux through the number of turns of the winding according to
. Therefore, the leakage magnetic flux can be expressed in Equation (2):
The leakage inductance, which represents the system’s ability to store magnetic energy along these leakage paths, directly depends on the geometric and constructive characteristics of the transformer. It can be calculated using Equation (3):
where
represents the magnetic permeability of free space,
is the number of turns in the winding,
is the effective cross-sectional area through which the leakage flux circulates, and “
is the effective length of the leakage flux path [
21].
Figure 1 shows the equivalent circuit of a single-phase transformer under no-load conditions with an inter-turn short circuit. In this model,
and
represent the resistance and leakage reactance of the main winding, while
and
represent the core losses and magnetizing reactance.
and
represent the resistance and inductance associated with the short-circuited turns, and
is the line voltage, allowing modeling of the fault current circulating within the winding itself.
The inter-turn short-circuit fault in a power transformer is a type of internal fault that occurs when the insulation between two or more turns of the same winding is compromised, allowing direct electrical contact between them. This condition leads to the circulation of a localized short-circuit current (fault current) through the shorted turns, resulting in excessive heating, a local increase in leakage magnetic flux, and consequently a significant alteration in the magnetic field distribution inside the transformer [
22].
This localized current does not necessarily immediately affect externally measured electrical parameters, such as voltage or line current, which makes early detection difficult using conventional techniques or standard protection devices. However, the cumulative effect of this fault may lead to progressive damage, such as deterioration of nearby insulation, winding deformation due to thermal and mechanical stresses, and even evolution toward more severe faults such as phase-to-ground or phase-to-phase short circuits, with the risk of total transformer failure.
The electrical behavior of a system with an inter-turn short circuit can be modeled using Equations (4) and (5), incorporating the variables associated with the fault. The following equations represent a coupled model of the primary winding and the shorted turns:
where
represents the main current in the winding,
is the inter-turn short-circuit fault current,
is the leakage inductance of the winding,
is the inductance of the short-circuited turns,
is the mutual inductance between the winding and the short-circuited turns,
is the winding resistance, and
is the resistance of the short-circuited turns.
Equation (5) shows that the inter-turn short circuit does not necessarily introduce an immediate change in the output voltage or total system current, which explains why these faults often go unnoticed in their early stages. However, the fault current
generates a localized magnetic flux that may oppose the main magnetic flux, distorting its symmetry and introducing harmonic components or anomalous variations in the leakage flux, which can be detected through appropriate sensors or advanced diagnostic techniques [
3].
According to standards such as IEEE C57.12.91-2020 on the field testing of transformers, this type of fault represents one of the main challenges in predictive monitoring and the preventive diagnostics of power equipment due to its progressive evolution and difficult early-stage detection [
23].
2.2. Statistical Time Features Processing
The signals captured by the magnetic sensor in the time domain are stored to evaluate changes in the power transformer operating under healthy and faulty conditions. For this purpose, the statistical indicators presented in
Table 1 are used, as they allow the extraction of relevant quantitative information from the signals and the generation of trends. For the implementation of the statistical analysis of the captured signal, a windowing process is applied, enabling the segmentation of the signals into defined intervals that allow the extraction of features and information regarding the dynamic behavior of the signals, where
is the signal in the time domain, for
, with
being the number of data samples in the signal.
The processing of large amounts of data for machine learning may generate a complex model, which can lead to overfitting or poor performance in evaluation metrics. Therefore, feature reduction algorithms are implemented to improve data quality and reduce their complexity, which are described below.
2.2.1. Feature Selection
The objective of this stage is to select a subset of input variables that contain more information and better describe the input data, to maximize relevance and minimize data redundancy. For this case study, Fisher Score is used, which is an automatic method based on the idea of finding a subset of features such that the data space of the selected features is evaluated within each class, with the purpose of ensuring that the distances between data from different classes are as large as possible, while the distances between data from the same class are as small as possible [
24], according to Equation (20).
where
is the number of data samples,
is the mean, and
is the standard deviation of the
i-th class of the
j-th feature, respectively, and
is the mean of the entire dataset for the
j-th feature. After calculating the Fisher Score of each feature, the features with the highest score are selected, indicating that the feature has a good capability to discriminate between classes [
25].
2.2.2. Feature Extraction
Feature extraction aims to project a high-dimensional dataset into a lower-dimensional subspace while preserving the most relevant information, facilitating the analysis process and improving the efficiency of learning models. There are different techniques, including linear, nonlinear, supervised, and unsupervised methods [
26]. In this study, both supervised and unsupervised techniques are employed. The supervised method considered is Linear Discriminant Analysis (LDA), and unsupervised techniques include Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and Isometric Mapping (ISOMAP).
LDA is a supervised dimensionality reduction technique whose objective is to project the data into a lower-dimensional subspace where the classes are as well separated as possible. This method identifies a new feature space by maximizing the between-class dispersion while minimizing the within-class dispersion. However, since this technique uses class label information to optimize separability, it may modify the original dynamic structure of the data by increasing the global distances between classes and reducing the local dispersion within each class [
27]. For this reason, in the present work, LDA is mainly used for representational and comparative purposes with respect to unsupervised dimensionality reduction techniques, which better preserve the intrinsic structure of the data.
PCA is a standard technique in modern data analysis, widely used in various scientific fields. Its objective is to identify a new basis that captures as much variance as possible from the original dataset, revealing hidden structures and reducing noise. This unsupervised technique is useful for tasks such as dimensionality reduction, compression, feature extraction, and data visualization.
PCA performs a linear transformation of the data by generating new variables, called principal components (PCs), which represent the directions of maximum variance in the data. These new dimensions allow the data to be projected into a more compact subspace while preserving the most relevant information. The procedure for applying PCA includes the following steps:
Calculate the covariance matrix: .
Perform the eigenvalue and eigenvector decomposition of .
Sort the eigenvalues and their corresponding eigenvectors in descending order.
Select a reduced number of dimensions , smaller than the original.
Construct the transformation matrix with the selected eigenvectors.
Project each vector
from the original space of dimension
into the new space of dimension
using
[
28].
UMAP is an unsupervised, nonlinear, manifold-based dimensionality reduction technique. Unlike methods such as PCA that prioritize the preservation of global distances, UMAP seeks to preserve both the local and global structure of the dataset. It is useful for both visualization and preprocessing in machine learning tasks. UMAP is based on three key principles:
Local approximation of the space using neighbor graphs.
Modeling connectivity through fuzzy simplicial sets.
Optimization of the reduced space to preserve the graph structure.
The local weighted distance is calculated using Equation (21):
where
is the distance to the nearest neighbor of
, and
is the adaptive parameter that controls the local scale.
On the other hand, symmetric connectivity is obtained from Equation (22):
where
represents the symmetric edge weight between points
and
in the fuzzy simplicial graph, combining the mutual membership strengths
and
.
Finally, the loss function (based on the cross-entropy divergence between high- and low-dimensional graphs) is represented by Equation (23).
where
and
are the low-dimensional representations of the original data points
and
, respectively;
is the Euclidean distance in the embedded space; and
and
are empirical parameters that define the shape of the distribution used to model similarities in the low-dimensional space. The term
represents the similarity between points
and
in the reduced space [
29].
ISOMAP is an unsupervised, nonlinear, manifold-based method for dimensionality reduction. Unlike linear techniques such as PCA, which preserve only Euclidean distances, ISOMAP maintains geodesic distances between points, allowing the proper representation of nonlinear structures in the data.
The method constructs a neighborhood graph by connecting each point
with its nearest neighbors. Then, it computes the approximate geodesic distances between all pairs of data points using algorithms such as Floyd or Dijkstra, generating the geodesic distance matrix
[
30]. This matrix is transformed using Equation (24):
where
is the centering matrix. Finally, Multidimensional Scaling (MDS) is applied to
, obtaining the eigenvectors and eigenvalues to project the data into a lower-dimensional space. The algorithm is obtained from the following steps:
2.2.3. Quality Metrics for Dimensionality Reduction Techniques
Dimensionality reduction techniques, when projecting data into a lower-dimensional subspace, inevitably involve a loss of information. Therefore, it is necessary to evaluate how faithfully the original relationships of the dataset are preserved [
31]. For this purpose, local and global dimensionality reduction quality metrics are used, as described below.
Trustworthiness is a metric that quantifies how well a dimensionality reduction technique preserves the local structure of the original space; that is, whether points that are close neighbors in the reduced space were also close in the high-dimensional space. This metric focuses on penalizing spurious neighbors, namely points that appear as neighbors in the projection but were not close in the original structure. A visualization is considered trustworthy if it introduces the smallest possible number of such false neighbors.
Formally, let be the dataset in the original space and its projection into the reduced space. For point , its nearest neighbors in both spaces are defined as , which is set of the nearest neighbors of in the original space. is the set of the nearest neighbors of in the projected space.
These two sets define the neighbors that appear in the projection but not in the original space, obtained as
. The global formula to measure the local distortion using Trustworthiness is given by Equation (25):
where
is the total number of points in the dataset,
is the rank of point
as a neighbor of
in the original space, and
is the number of neighbors considered, typically between 5 and 15.
The key parameters to determine the quality of dimensionality reduction using the global formula are interpreted as follows:
Trustworthiness > 0.9: excellent preservation, adequately maintains local relationships.
0.8 < Trustworthiness ≤ 0.9: good representation.
Trustworthiness ≤ 0.8: possible significant distortion [
32].
Spearman’s correlation is based on the ranks of these distances (rather than their absolute values), making it less sensitive to scale and more focused on preserving the relative order among distances. It is obtained from Equation (26):
This negative correlation value is used as a loss function: values closer to one indicate greater preservation of the global structure, since the distance ranks between pairs remain consistent in both spaces.
Since ordinary ranks are not differentiable (which prevents their direct use in gradient-optimized methods), a technique called soft ranking is employed. This allows a smooth approximation of traditional ranks and makes it possible to compute a differentiable version of . The parameters to evaluate dimensionality reduction quality are defined as follows:
This metric is particularly useful when preserving the relative order between points is more important than maintaining exact distances [
33].
Kruskal Stress, also known as Stress-1, is one of the most widely used metrics for this purpose. Its function is to quantify the discrepancy between the original dissimilarities and the distances represented in the reduced space, serving as a direct indicator of the quality of fit by evaluating global distances. Kruskal’s stress formula is defined by Equation (27):
where
represents the transformed dissimilarities or disparities,
are the distances between points
and
in the reduced space of dimension
, and
are optional weights, and are equal to one if the pair is present and zero if data are missing [
34].
To evaluate data quality, Kruskal (1964) [
35] proposed a qualitative scale to interpret the following:
Stress-1 values as follows:
Stress-1 > 0.20: Very poor
Stress-1 ≈ 0.10: Acceptable
Stress-1 ≈ 0.05: Good
Stress-1 ≈ 0.025: Excellent
Stress-1 = 0: Perfect
However, these rules should be used with caution. The stress value may be influenced by several factors, such as the following:
Number of objects : As the number of data points increases, stress is expected to increase.
Number of dimensions : As the number of dimensions increases, stress tends to decrease.
Level of noise or error in the original dissimilarities.
Ties in the data (if ordinal transformations are used).
Presence of missing values, which often artificially reduce stress [
36].
2.3. Machine Learning Processing
Support vector machines (SVMs) have been established as an effective tool in classification problems, especially in contexts involving small, high-dimensional, and nonlinear datasets. Their popularity in fault diagnosis is due to their ability to find optimal decision boundaries, even in scenarios where the data are not linearly separable [
9].
In essence, an SVM seeks the optimal hyperplane that maximizes the margin between two classes. This hyperplane is defined by Equation (28):
where
is the weight vector,
is the input vector, and
is the bias. The objective is that the support vectors satisfy the following conditions of Equations (29) and (30):
where
and
represent the two linearly separable classes in the feature space.
The training of an SVM consists of solving the following optimization problem:
It is subject to the following:
where
denotes the Euclidean norm of the weight vector,
is the
-th training sample,
is the corresponding class label of each sample
, and
is the total number of training samples.
The solution to the problem leads to a linear combination of the samples closest to the hyperplane, known as support vectors, defined by Equation (33):
where
are the Lagrange multipliers obtained from the dual optimization problem. Only samples with
correspond to support vectors and contribute to the final classifier.
For nonlinear problems, a kernel function is used to transform the data into a higher-dimensional space where they become separable. The resulting model is expressed according to Equation (34):
One of the most common kernel functions is the radial basis function (RBF) is defined by Equation (35):
In this expression,
is referred to as the kernel scale and determines how far an individual sample can influence the classification. Small values of
generate more complex models that may overfit the data; large values may lead to underfitting. Therefore, the choice of
has a direct impact on the model’s ability to generalize. In addition, a penalty parameter
is introduced, which controls the trade-off between maximizing the margin and allowing classification errors through slack variables. A high value of
strongly penalizes errors, which may lead to overfitting; a low value allows more training errors but improves generalization [
37].
The proper selection of parameters
and
is essential to achieve high classification performance. Both parameters must be carefully tuned, typically through cross-validation or optimization algorithms such as genetic algorithms (GAs), which are used in this case study [
38].
Figure 2 presents a schematic representation of the general architecture of an SVM, where the transformation of input data into a higher-dimensional feature space through the kernel, the construction of the optimal hyperplane, and the resulting classification based on the decision function can be observed.