Prediction of Stress in Power Transformer Winding Conductors Using Artiﬁcial Neural Networks: Hyperparameter Analysis

: The purpose of this research is the evaluation of artiﬁcial neural network models in the prediction of stresses in a 400 MVA power transformer winding conductor caused by the circulation of fault currents. The models were compared considering the training, validation, and test data errors’ behavior. Different combinations of hyperparameters were analyzed based on the variation of architectures, optimizers, and activation functions. The data for the process was created from ﬁnite element simulations performed in the FEMM software. The design of the Artiﬁcial Neural Network was performed using the Keras framework. As a result, a model with one hidden layer was the best suited architecture for the problem at hand, with the optimizer Adam and the activation function ReLU. The ﬁnal Artiﬁcial Neural Network model predictions were compared with the Finite Element Method results, showing good agreement but with a much shorter solution time.


Introduction
Electromagnetic forces and stresses in conductors are usually determined with the Finite Element Method (FEM) [1][2][3][4][5]. As a numerical method, FEM needs the discretization of the medium, which results in the creation of nodes, each one represented as one row and one column in a matrix [6]. This process is not unique because FEM internally looks for the discretization that presents a slight field variation between nodes [7]. This is achieved through an iterative process, which lasts longer when the parameters of the problem have a nonlinear behavior-such as the permeability of the transformer core.
As a consequence, FEM could spend excessive time in obtaining one solution. When the transformer design requires the electromagnetic forces, this is not a problem because just a small number of calculations need to be performed. For example, if the procedure of evaluation of design recommended in [8] is followed, only the worst force and stress must be calculated, which implies one FEM simulation. Even if the dynamical analysis of the same reference is considered, 250 simulations should be performed at most, which correspond to 2 s at 60 Hz, at the highest value of short circuit current and the most pessimistic transient conditions.
However, if the forces must be calculated multiple times, FEM could become cumbersome. For that reason, finding a method that quickly determines the stress in winding conductors of a power transformer, such as an artificial neural network (ANN), may be very useful.
The interaction between artificial intelligence and electromagnetic forces is not widely studied in the research community.
Some investigations are focused on the analysis of possible damage inside the equipment. For example, in [9], ANNs were applied to classify the condition of power transformers as normal, degraded, or anomalous. The novelty of the research was the identification of nonlinearities in the response of the vibrations before electrical load variations. In the experimental setup, the vibrations due to electromagnetic forces in windings were separated from those due to magnetostriction in the core. The ANN classification results were compared with those of a Naive Bayesian Classifier, giving better accuracy.
Other related investigations are not focused on forces but on the determination of the magnetic field. In [10], magnetic fields were calculated with the use of Convolutional Neural Networks (CNN). In that research, a transformer with variable dimensions of the core and windings was included in the analysis. The authors used pictures of the magnetic field distribution as input for training. Each pixel was an input for the CNN. The training process was of the supervised kind. The authors used dropout layers to increase the generalization of the network. The dropout neurons were probabilistically chosen through a Bayesian Monte Carlo Technique. This same technique was used to detect if the problem was not suitable to be modeled with a CNN. An important feature of the research is the use of dilated filters, which fit well with the behavior of the magnetic field, which is influenced by the surrounding fields. The main limitation of the research is that the electrical current was not varied when creating the training data. This restriction limits the application of the CNN model when the goal is its use in practical problems. The study is limited to only one current, which could become a restriction when the fields and the forces are required for different transformer operating conditions.
A similar approach to using CNN-treating the magnetic field like a picture-was used in [11]. In this case, the problem is the determination of electromagnetic scattering when there is an incident wave. Finite differences in the frequency domain were used to generate the training data corresponding to wave and scattering pictures. The novelty of the paper is the use of deep residual networks to improve the network accuracy despite the number of hidden layers. The main difference with the problem of the fields in power transformers is the characteristics of the internal components' dimensions. While the core has lengths in the order of one meter, the winding conductors or disks could be as thin as millimeters. This wide range of dimension values brings difficulties in treating the pictures required to use this technique.
This research presents an ANN design to determine the stress in the conductors of a 400 MVA, three-phase power transformer produced by the circulation of three-phase fault currents.
FEM simulations are used to train the ANN, where a set of possible fault currents is the input data, and the corresponding stresses in the conductors are the output data. As discussed, each FEM simulation takes a long time. Thus, if the objective is to monitor the effects of the stress on the power transformer in real-time, the FEM simulations would not be the most appropriate option. On the other hand, when developing the ANN model, the long FEM simulation time is needed just during the data collection. After that, the ANN model can predict the stress due to any electrical current inside the range that was established during the creation of the model. As a consequence, there would be a handy ANN model fast enough to get results in real-time, and that could be used during the transformer's lifetime.
In regards to the design of the ANN model an objective is to find the best combination of alternatives based on three characteristics: architecture, activation functions, and optimizers used in the ANN model. The best combination is assessed by checking three attributes: the error for the training data, the generalization with the validation data, and the convergence time of the process.

Artificial Neural Networks Characteristics
Two concepts are commonly used in ANN literature: parameters and hyperparameters. Parameters change during the training process, i.e., they are controlled by the computational program-for example, the weights on each connection. Hyperparameters are defined outside the training process and usually are predefined, for example, the number of ANN layers [12]. Hence, the designer's task is to find the appropriate hyperparameters to get the lowest difference between the actual and the calculated value.
The difference between the calculated output valueŷ and the actual value y is the error function of the model L, determined in Equation (1). This function is minimized during the training process.
The objective of the ANN model design is to obtain the minimum error. The hyperparameters that the designer can change are the architecture, the optimizer, and the activation function. The architecture includes the number of layers and the number of neurons on each layer. The optimizer is the method to minimize the error function. The activation function introduces nonlinearities in the model.

Optimizers
Each iteration updates the weights to reach a point as close as possible to the error function's minimum. This optimization problem is multidimensional (one dimension per weight); hence, the best direction of change for the weights is given by the function's gradient to be optimized, i.e., the error function. This research has compared four optimizers: SGD, SGD with Momentum (SGDmom), RMSprop, Adam, whose implementations are shown from Equation (2) to Equation (5). α is the learning rate, η is the momentum, and v is the velocity of change. ρ, , m, β 1 , and β 2 are factors according to each optimizer.

Activation Functions
The activation functions that are part of the analysis in the present research are • Leaky ReLU [16,17] • PReLU [18]: The factor a is determined adaptively during training [19].

Power Transformer Characteristics
The transformer under analysis is a two-winding, three-phase power transformer. Table 1 presents its characteristics, Table 2 indicates the dimensions, and Figure 1a shows its internal structure. The windings are of a disk type. The transformer core material is US Steel Type 2-S, 0.018 inch thickness, whose saturation curve is represented in Figure 1b. This material is part of the FEMM software library [20].

Currents in Windings
The currents considered to calculate the conductors' stress are those of three-phase faults seen by the transformer. The range goes from fault currents on the transformer terminals to fault currents with values lower than the rated current.
The peak values are taken from the transient currents. For simplicity, the origin of the fault is located when one phase current (A, B, or C) crosses through zero. In Equation (10), the general form to calculate the currents in the three phases of the low-voltage windings is shown [21]. The current previous to the fault has not been considered when generating the training samples; however, in the final results, it will be seen that the model is accurate even for preloaded cases. Table 3 shows the correspondence between the phase currents and those of Equation (10), according to the instant of the fault. Table 3. Correspondence between the phase currents and Equation (10).

Phase Crossing the Zero Current
I LV is the maximum fault current when it has reached the steady-state. It depends on the impedance Z seen by the transformer during the fault-i.e., on the resistance R and the inductive reactance X L -as shown in Equation (11), where V LV is the peak line-to-neutral voltage of the low-voltage winding (112.7 kV for this study).
where ω is the angular frequency (2 · π · 60 for a 60 Hz power frequency). θ is the angle to control the fault's time; one phase will see the fault at zero seconds or zero radians, whereas the other two phases will see the fault at ±120 • , depending on the phase. φ is the angle of the impedance Z, i.e., cos φ = R/Z. λ is the time constant of the electromagnetic transient, i.e., λ = R/L, where L is the inductance of the inductive reactance X L .
The transformer's impedance is the lowest limit, which, in the generation of input data, represents the current at the transformer's terminals. The leakage reactance is 33.61%, which is 16 Ω in the transformer base. The highest limit has been set until a reactance of 80 Ω, as Table 4 shows. Table 4. Range of impedances used in the calculation of the phase currents.

Electromagnetic Forces
Electromagnetic forces act on the conductors of a winding when the current interacts with the surrounding magnetic field [22]. In mathematical form, this is expressed in Equation (12); f is the force per volume, J is the current density, and B is the magnetic induction [23].
Therefore, the main task is to find the magnetic induction B since J is part of the input data. From Electromagnetic Theory, B can be obtained by first calculating the magnetic vector potential A in Equation (13) (Laplace equation for magnetics) and then B in Equation (14) [24]. However, this process requires the treatment of partial differential equations (PDE), which usually do not have analytical solutions unless the problem has a simple geometry [25]. That is not the case for power transformers, which, on the contrary, have a complex geometry [26].
FEM is a numerical method used to solve partial differential equations that have the form of Poisson or Laplace equations. Hence, the method can be used to find the magnetic field B in power transformers. FEM discretizes the medium where the field must be determined. Like any partial differential equation, Equation (13) also needs a boundary condition to have a unique solution. This condition is considered inside FEM and must be set during the simulation. FEM simulations for power transformers use the Dirichlet boundary condition, which sets a null value for A in some predetermined border. The ideal case would be to have a null vector magnetic potential at infinity because it works as a reference for the rest of the system. As it is not physically possible to have this condition, an appropriate border is chosen by following the convergence criteria for computational methods [27], which recommends performing simulations until no high variation is seen in the values of the potential inside the power transformer. Moreover, the boundary conditions have been located symmetrically around the equipment to have a similar influence in the internal fields. Figure 2 presents the discretization of the 400 MVA power transformer under analysis, generated by the software FEMM. Note that around the disks, the discretization is denser. This characteristic of the triangulation process allows finding a more detailed field where it is suspected that the magnetic induction will have a higher change.  Figure 3 shows the magnetic induction when the transformer is working under rated conditions. The current on Phase A is at its peak. The magnetic induction is the highest in the first limb, which belongs to Phase A. The same behavior is seen in the flux lines. Note that some lines cross the disks of the transformer. They represent the magnetic flux that, together with the current density, will produce a force in the disks.  For this same case, Figure 4 shows the radial and axial forces on the low-voltage winding. Disk 1 is the highest of the winding, and Disk 100 is the lowest. The radial forces are all positive, which means they are directed towards the core. The maximum radial force is located around the middle of the winding. The axial forces are negative in the highest disk and positive in the lowest disk. This characteristic means that they are directed towards the middle of the winding.  In the high-voltage winding (see Figure 5), the radial forces are negative. They are directed away from the core. The axial forces have similar behavior to the low-voltage radial forces, which means that they point towards the middle of the winding.

Winding Conductors Stress
The stresses in the winding conductors are the outputs of the ANN. For the sake of simplicity, the ANN design considers only the middle winding conductors, which have the highest value of stress and force. Thus, for the low-voltage winding, the stress is stored from disk 48 to disk 51; for the high-voltage winding, it is held from disk 49 to disk 51.
In addition, to reduce the simulation time, each disk of the winding is taken as one conductor. This simplification in the model is possible because the stress is defined as a per area magnitude, which does not have a high difference among the disk conductors when there is no considerable variation in the force per volume. The procedure developed in this research has taken the middle conductor of the disk to analyze the average value of stress.
The model considers each disk as a ring. Only the radial force is counted towards calculating the stress because it is the only one that can have a value without causing displacement of the ring. The axial force is nullified in the winding structure, either by another disk or by the winding supports; if that were not the case, the disk would have vertical movements [28]. Figure 6 shows a ring subject to a radial force. The wire radius is much lower than the radius R. The radial force f r causes P, which is perpendicular to the wire section. Due to the force P, there is a stress σ acting on the wire. In that way, the stress σ is calculated according to Equation (15) [29]. The force f r is the radial component of the force per unit length.
R P P f r Figure 6. Ring subject to the action of a radial force f and a force P perpendicular to its section.
The following example uses a case with R = 1 Ω and X L = 15 Ω to explain the approach to find the conductors' stress. Table 5 presents the values used in Equation (10) to calculate the current as a function of time. Figure 7a shows the three phases' current shape for a fault starting when the Phase A current is zero.  The current and the transformer internal geometry are the input data for the FEM simulations. The output of the FEM simulations is the magnetic field. Then, the stress is calculated as described before. Figure 7b shows the stress for the example.

ANN Design
The input data of the ANN are the currents circulating through the windings and the output data are the stresses on the conductors, as shown in Figure 8.

Input Layer Hidden Layers Output Layer
High and Low Voltage Winding Currents SGD models are developed to obtain a set of possible architectures. The learning rate used during the process is 0.1 because a value less than that would take too much time and too many epochs to converge. The following architectures had the best performance and are part of the next design steps.

Stress in the Winding's Middle Disks Conductors
One The same quartiles of SGDmom are used to tune epsilon in Adam. The lowest error appears for the epsilon 10 −4 . Table 6c confirms this result.

Epsilon
Mean value 10 −7 6.60 · 10 −6 10 −6 5.91 · 10 −6 10 −5 4.14 · 10 −6 10 −4 3.11 · 10 −6 10 −3 3.80 · 10 −6 Table 7 summarizes the hyperparameters used in the process for each optimizer. RMSprop and Adam have a lower learning rate than SGD and SGDmom because they are faster to converge.  Figure 9 shows the training time for each optimizer. By far, SGD needs the longest time to converge, even with a higher learning rate. It is about forty times greater than RMSprop and Adam times.  In Figure 10a, the performance for the optimizers analyzed in this research is presented for a patience 10. RMSprop and Adam have the lowest error; hence, they are chosen for the analysis with patience 40, which is shown in Figure 10b. Both optimizers have similar errors, though Adam has a slightly lower value; therefore, it is chosen for the ongoing analysis.

Choosing the Activation Function
Similar to the procedure to get the optimizers' hyperparameters, the first, second, and third quartiles obtained in the Adam process with patience 40 are used to analyze the activation functions. The related architectures are [251 32], [251 1000], and 126, respectively. Table 8a shows the behavior of the error considering the variation of Alpha for Leaky ReLU. Alpha 0.1 presents the lowest error and is used in the detailed analysis. Table 8b shows the behavior of loss for alpha with the ELU activation function. The value of alpha 0.3 presents the lowest error.  Figure 11 presents the loss distribution for the four activation functions: ReLU, Leaky ReLU, ELU, and PReLU. The behavior is similar for all of them. ReLU and Leaky ReLU have the lowest losses. ReLU is the activation function chosen because of its natural simplicity. To conclude, the best suited ANN has the following characteristics: • Optimizer-Adam; • Activation Function-ReLU.

Choosing the Final Architecture
Until this point, the design procedure has considered all the architectures described at the beginning of Section 4. Once the optimizer and the activation function have been settled, along with the related hyperparameters, the architectures included in the first quartile are used to continue with the design: One  Figure 12 shows the error for each architecture. The best four, one-hidden-layer architectures have a better performance than any two-hidden-layer architectures. In addition, they have a similar value of error. The architecture with one hidden layer and 1995 neurons is chosen for the simplicity of the model. According to the training processes, the model does not require additional tuning since the error will not decrease significantly.   Figure 13 presents the process of training the final design with a patience 100. The error consistently reduces until the number of epochs is close to the thousands. Then, the error is almost constant. There is a noise, which means that the algorithm has reached a minimum and is skipping from one point to another. The final design is validated with data that were not used for training nor testing during the design process. The error with the validation data was 7.8136 · 10 −8 . The validation data has 795 samples from different values of resistance and reactance combinations.

Results in Actual Measurement Units
The model is applied to the transformer where the input data are the current through the windings in amperes, and the output data are the stresses on the winding's middle disks in Pascals. For the simulated cases, the stress is calculated at the points where the currents (hence, the forces and the stresses) have an extreme value. Figure 14 illustrates this point for one cycle. For Phase A, the stress is calculated in π and in 2π. For Phase B, the stress is calculated in 2π/3 and in 5π/3. Finally, for Phase C, the stress is calculated in π/3 and in 4π/3. The process is repeated for the rest of the analyzed time.
All of the cases simulate a fault that begins when Phase A is crossing through zero, and the power transformer has a prefault rated current. The first case simulates a fault with resistance R = 1 Ω and a reactance X L = 80 Ω. This impedance represents a fault located far away from the transformer's terminals. Figure 15 shows the results for FEM and ANN simulations. The high X L /R ratio extends the transient of the phenomenon. Thus, the cycles with high stress do not touch the cycles with low stress within the fifty cycle simulations. The results for FEM and ANN simulations are very close. The second event is a fault with a resistance R = 1 Ω and a reactance X L = 47 Ω (see Figure 16). The impedance is lower than in the first case, which causes higher stress on the conductors. The resistance/reactance ratio is also lower; hence, the transient ends faster and the difference between cycles of high stress and low stress is reduced at the end of the fifty cycles period. Some differences in the results of FEM and ANN are seen at the beginning of the phenomenon, mainly when the stress is high. Nonetheless, such difference is negligible.    Table 9 presents the mean absolute percentage error for the foregoing cases. The model has the least error when the fault is at the terminals of the power transformer. As long as the fault is farther away, the error increases. In each case, the error is greater when the stress is lower. This behavior agrees with the training characteristics of ANNs that are more influenced by higher values than by lower ones.

Simulation Time
A fault with R = 1 Ω and X L = 15 Ω was simulated with FEM and the ANN model to obtain the time of the simulation. Figure 18a presents the time spent in FEM simulations. The relationship between the number of samples and the time of simulation is linear. Each sample takes about 143 s for its simulation. For this reason, the simulation of a phenomenon with 150 pulses of stress required about six hours.
On the other hand, ANN simulations require about 0.53 s for 10,000 samples, see Figure 18b. The reduction of time due to the direct application of the model is evident. This reduction was expected since the model contains only matrix operations and the use of the activation function. The computational system limits the time reduction; therefore, the minimum simulation time is constant (0.085 s) from the tens to the hundreds of samples.
The high number of samples that can be simulated in less than one second opens the possibility of using the model in real-time applications. Considering only the peak values of stress, there are six values per cycle (two per phase); for a power frequency of 60 Hz, that means a sample each 2.78 · 10 −3 s. In real-time, 10,000 samples would require 27.8 s. Consequently, the 0.53 s are enough for a real-time simulation of this kind. The limit would be a batch of 40 samples that has an average of about 470 samples per second to be simulated, which is higher than the 360 samples per second needed for stress analysis. This analysis was performed in a personal computer with an Intel i5, 64 bit, 2.50 GHz processor and 4.00 GB of RAM.

Conclusions
This research shows the possibility of modeling the stresses in winding conductors of a 400 MVA, three-phase power transformer using artificial neural networks. For this purpose, an analysis of the combination of hyperparameters that best reduces the error between the proposed model and the finite element analysis results was performed.
For the conductor stress problem, the single-layer models performed better than the two-layer ones. Additionally, more complex models took longer to minimize the error function, which is a disadvantage during the design process. In other words, it is not always true that a more complex model is better.
As the optimizers were more elaborated or had more hyperparameters, the algorithm found the minimum of the loss function faster. RMSprop and Adam had a better convergence speed than SGD or SGDmom. As a consequence, lower learning rates could be used, and the model achieved lower losses.
As for this research's problem, the activation functions did not have too much influence on the model's accuracy. ReLU was chosen, in the end, because of its simplicity since it does not have any hyperparameter and its implementation is straightforward.
With groups of data greater than 40 samples, it is even possible to use the artificial neural network model in real-time applications when the stress is needed only in peak values, such as in fatigue analysis.
A possible future investigation could be the design of an artificial neural network model for stresses caused by inrush currents. The problem has particular characteristics that make it challenging for an artificial neural network model. Mainly, the nonlinear features of the core could be an issue in modeling the phenomenon overall because, in this case, the transformer might be working under saturated conditions.