A Novel Fault Diagnosis Method for a Power Transformer Based on Multi-Scale Approximate Entropy and Optimized Convolutional Networks

Dissolved gas analysis (DGA) in transformer oil, which analyzes its gas content, is valuable for promptly detecting potential faults in oil-immersed transformers. Given the limitations of traditional transformer fault diagnostic methods, such as insufficient gas characteristic components and a high misjudgment rate for transformer faults, this study proposes a transformer fault diagnosis model based on multi-scale approximate entropy and optimized convolutional neural networks (CNNs). This study introduces an improved sparrow search algorithm (ISSA) for optimizing CNN parameters, establishing the ISSA-CNN transformer fault diagnosis model. The dissolved gas components in the transformer oil are analyzed, and the multi-scale approximate entropy of the gas content under different fault modes is calculated. The computed entropy values are then used as feature parameters for the ISSA-CNN model to derive diagnostic results. Experimental data analysis demonstrates that multi-scale approximate entropy effectively characterizes the dissolved gas components in the transformer oil, significantly improving the diagnostic efficiency. Comparative analysis with BPNN, ELM, and CNNs validates the effectiveness and superiority of the proposed ISSA-CNN diagnostic model across various evaluation metrics.


Introduction
Oil-immersed power transformers are vital components in power systems, primarily utilized for voltage regulation and the transmission and distribution of electrical energy [1].These transformers utilize insulating oil for effective heat control.The design and operation of transformers directly impact the quality of the electrical energy and the reliability of the power system.Therefore, understanding the operational status of oil-immersed transformers and ensuring their safe and stable operation are crucial for the reliability of the power system [2].
The fault diagnosis method for oil-immersed transformers based on dissolved gas analysis (DGA) in oil has gained widespread application in recent years [3,4].By analyzing the gas content in the transformer oil, this method can effectively identify the types of electrical faults, discover potential issues, and provide crucial information for proactive maintenance of transformers.As a result, it has become increasingly prevalent in the field.Currently, the traditional diagnostic methods for dissolved gases in transformer oil include the three-ratio method [5] and the Duval Triangle method [6].However, these approaches suffer from shortcomings such as insufficient coding and excessive absoluteness, leading to a higher rate of misjudgment.This phenomenon results in their inability to accurately diagnose certain faults.Therefore, there are now various intelligent diagnostic methods for oil-immersed transformer faults based on DGA.These methods mainly include Artificial Neural Networks (ANNs), Support Vector Machines (SVMs), Expert Systems Data-driven fault detection methods utilize machine learning and data analysis techniques to detect equipment faults [11].They analyze real-time sensor data or historical data, build models, and compare them with fault patterns.In recent years, these methods have been widely applied in various fields, such as industrial processes [12], HVAC systems [13], energy systems [14], potential fault identification [15], sensor analytics [16], and medical device digital systems [17].Deep learning theory possesses robust feature learning and pattern recognition capabilities, extracting effective information from large-scale and complex data [18].In recent years, deep learning, particularly convolutional neural networks (CNNs), has found widespread application in fault diagnosis [19].CNNs, using convolutional and pooling layers, automatically learn the local and global features from the input data.It can provide effective representation of images, sequences, and so on [20].The strength of CNNs lies in their efficient processing of complex data and feature learning capabilities.The proper hyperparameters, such as learning rate and filter size, are crucial to the model performance in CNNs.The sparrow search algorithm (SSA) was proposed in 2020 as a novel swarm intelligence optimization algorithm [21].It primarily achieves position optimization by emulating the foraging and anti-predatory behaviors of sparrows, aiming to locate the local optimum of a given problem [22].This study introduces an improved sparrow search algorithm (ISSA) for CNN parameter optimization.ISSA can dynamically adjust these parameters to enhance model generalization and robustness.The proposed approach will be applied to transformer fault diagnosis, showcasing the potential of CNNs optimized with ISSA.
The DGA method primarily utilizes the characteristic gas content for transformer fault diagnosis [23].However, the composition of dissolved gases in oil is highly complex and uncertain.Therefore, assessing the uncertainty solely based on the decomposed gas content is challenging.This study introduces information entropy [24] as a feature indicator for transformer fault diagnosis.Information entropy, a concept from information theory, measures system uncertainty and information quantity.In transformer diagnosis, information entropy can be employed by analyzing the concentration distribution of dissolved gases, assessing system states.Higher entropy values indicate greater system complexity and uncertainty, potentially indicating underlying faults.Information entropy analysis enhances the understanding of system health, supporting early fault detection and prediction [25].Approximate entropy, a calculation method for information entropy, is commonly used for time-series data analysis [26].It assesses system complexity and regularity, revealing patterns or trends in data.Multi-scale approximate entropy considers signal characteristics at different scales, observing how complexity evolves with scale changes [27].This method contributes to a comprehensive understanding of dynamic signal characteristics.It provides in-depth insights into system behavior across different time scales.Currently, approximate entropy has demonstrated effective applications in various fields, including biosignal analysis [28], short-circuiting arc welding analysis [29], mechanical vibration measurements [30], and environmental monitoring [31].In transformer diagnosis, this paper attempts to enhance early fault prediction by calculating the multi-scale approximate entropy of dissolved gases in oil, offering a more comprehensive insight into system state changes.
This study initially collects the characteristic gas content of oil-immersed transformers under various fault types, including H 2 , CH 4 , C 2 H 6 , C 2 H 4 , and C 2 H 2 .Subsequently, the content ratios of different gas types are obtained.The multi-scale approximate entropy values are then calculated through content ratios to assess the gas complexity.Finally, the multi-scale approximate entropy values serve as feature inputs for an optimized CNNbased classifier, deriving diagnostic results.Field data demonstrate the proposed method's effectiveness and superiority in transformer fault diagnosis.
The structure of this paper is as follows.The principles of the relevant algorithms are detailed in Section 2. Section 3 presents an oil-immersed transformer fault diagnosis model based on multi-scale approximate entropy and optimized CNNs.Section 4 shows the performance of the proposed diagnostic model.Section 5 concludes the paper.

Approximate Entropy
Approximate entropy is a non-linear dynamical parameter used to quantify the regularity and unpredictability of fluctuations in a time series.It is represented by a non-negative number that reflects the complexity of a time series, indicating the likelihood of new information occurring in the time series.The more complex the time series, the higher the corresponding approximate entropy.
Define the distance between x(i) and x(j) as d[x(i), x(j)] to be the maximum of the absolute differences between their corresponding elements.
Given a threshold r, for each value of i, count the number of distances d that are less than r, and calculate the ratio of this count to the total number of distances N − m.
Take the logarithm of C m i (r), and then calculate the average across all i in Equation ( 4).
In theory, the approximate entropy of this sequence is defined as: When N is a finite value, the ApEn estimate obtained by following the above steps for a sequence of length N is denoted as:

Multi-Scale Approximate Entropy
Multi-scale approximate entropy (MApEn) extends the concept of approximate entropy to multiple time scales.It provides additional perspectives when dealing with data of uncertain time scales.The approximate entropy does not adequately account for different time scales that may exist within a time series.The objective of multi-scale entropy is to assess the complexity of time series.
The fundamental principle of multi-scale entropy involves coarsening or downsampling, primarily analyzing the time series at progressively coarser time resolutions.Coarsegrained data take the average of different numbers of consecutive data points to create signals at different scales.The specific steps are as follows.
When Scale = 1, the coarse-grained data are the original time series.When Scale = 2, the coarse-grained time series is formed by calculating the average of two consecutive time points, as defined in Equations ( 7) and (8).
( ) where τ represents the time scale.

Improved Sparrow Search Algorithm
SSA is a heuristic optimization method inspired by the collective behaviors of sparrow bird populations.It utilizes a combination of individual exploration and informationsharing strategies to address optimization challenges.This algorithm is conceived as an optimization approach that draws inspiration from the foraging and migration patterns of sparrows.However, the SSA algorithm is susceptible to the influence of problem complexity and parameter settings, resulting in slow convergence and low accuracy.In this article, the following improvement strategies are proposed.
(1) This article employs chaotic mapping for the initialization of the SSA population to achieve stable population quality.The generated chaotic sequences are as described in Equation (10).
In this context, where K represents the population size and I is the current iteration count, u takes on random values between 0 and 1.The process for initial position generation of sparrow individuals using the chaotic sequence is as follows.
(min) ((max) (min) )  The mathematical definition of the above coarse-grained process is as follows.
where τ represents the time scale.

Improved Sparrow Search Algorithm
SSA is a heuristic optimization method inspired by the collective behaviors of sparrow bird populations.It utilizes a combination of individual exploration and informationsharing strategies to address optimization challenges.This algorithm is conceived as an optimization approach that draws inspiration from the foraging and migration patterns of sparrows.However, the SSA algorithm is susceptible to the influence of problem complexity and parameter settings, resulting in slow convergence and low accuracy.In this article, the following improvement strategies are proposed.
(1) This article employs chaotic mapping for the initialization of the SSA population to achieve stable population quality.The generated chaotic sequences are as described in Equation (10).
In this context, where K represents the population size and I is the current iteration count, u takes on random values between 0 and 1.The process for initial position generation of sparrow individuals using the chaotic sequence is as follows.
where (min)X K I and (max)X K I represent the minimum and maximum values of X K I , respectively.
(2) To prevent being stuck in local optima, this article introduces a non-linearly decreasing weight ω m in the update of SSA discoverer positions.The calculation formula is as follows.
where ω 1 and ω 2 are inertia adjustment parameters with values of ω 1 = 0.9 and ω 2 = 0.4, and t max represents the maximum number of iterations.The weight has a slower decay at the beginning of iterations, favoring global search for the optimal solution's position.
(3) This article introduces a mutation strategy to update the contributors.A Gaussian mutation operator is introduced to perturb the global best solution.This can prevent being trapped in local optima.Gaussian mutation operator is defined in Equation ( 13).
where X t+1 gauss represents the Gaussian best solution, and Gaussian(α) denotes a random vector following a Gaussian distribution, with a mean of 0 and a variance of 1.
A flowchart of the ISSA is shown in Figure 2.  To validate the performance of ISSA, this study conducted numerical simulation experiments on six selected functions from the Congress on Evolutionary Computation test suite [32].A comparative analysis was carried out with Particle Swarm Optimization (PSO) [33], Grey Wolf Optimizer (GWO) [34], Gravitational Search Algorithm (GSA) [35], Entropy 2024, 26, 186 7 of 20 and African Vultures Optimization Algorithm (AVOA) [36].The six functions and their parameters are listed in Table 2.The convergence curves of different algorithms are illustrated in Figure 3.

Test Functions
Search Range Optimal Value Dimension From Figure 3, it can be observed that ISSA exhibits the fastest convergence speed on various test functions, demonstrating significantly better performance compared to PSO, GSA, GWO, and AVOA.

CNNs
CNNs are a type of deep feedforward neural network with a hierarchical structure.The architecture primarily includes convolutional layers, pooling layers, activation layers, and fully connected layers.

Convolutional Layer
The main role of the convolutional layer in CNNs is to perform feature extraction on the input.The convolutional kernels in different layers have varying sizes, allowing the network to capture features of different scales.As a result, CNNs can extract multi-scale feature information.The calculation formula for the output value l j a of the jth unit in convolutional layer l is as follows.

(
) where l j M represents the selected set of input feature maps, and k represents the learn- able convolutional kernel.

Pooling Layer
Pooling operations are performed independently on each subset of data..The purpose of pooling is to gradually reduce the spatial dimensions of the data volume.This helps reduce the number of parameters in the network, saving computational resources effectively.Pooling is commonly used during upsampling and downsampling processes.It has no learnable parameters.The calculation of activation value in pooling layer l is based on Equation (15).
where down(.)represents pooling function,  From Figure 3, it can be observed that ISSA exhibits the fastest convergence speed on various test functions, demonstrating significantly better performance compared to PSO, GSA, GWO, and AVOA.

CNNs
CNNs are a type of deep feedforward neural network with a hierarchical structure.The architecture primarily includes convolutional layers, pooling layers, activation layers, and fully connected layers.

Convolutional Layer
The main role of the convolutional layer in CNNs is to perform feature extraction on the input.The convolutional kernels in different layers have varying sizes, allowing the network to capture features of different scales.As a result, CNNs can extract multi-scale feature information.The calculation formula for the output value a l j of the jth unit in convolutional layer l is as follows.
where M l j represents the selected set of input feature maps, and k represents the learnable convolutional kernel.

Pooling Layer
Pooling operations are performed independently on each subset of data..The purpose of pooling is to gradually reduce the spatial dimensions of the data volume.This helps reduce the number of parameters in the network, saving computational resources effectively.Pooling is commonly used during upsampling and downsampling processes.It has no learnable parameters.The calculation of activation value in pooling layer l is based on Equation (15).
where down(.)represents pooling function, b l j is the bias, β l j means the multiplicative residual, and M l represents the size of the pooling window.
For an input p, the ReLU function returns an output equal to the maximum value between p and 0. If p is greater than or equal to 0, the output is p itself; otherwise, the output is 0.

Fully Connected Layer
The parameters in the fully connected layer include the total number of fully connected layers and the number of neurons in each individual layer.Increasing the width of the fully connected layer and the number of layers can enhance the model's non-linear expressive power.

Optimized CNNs with ISSA
The basic process of CNNs based on ISSA is illustrated in Figure 4.
For an input p, the ReLU function returns an output equal to the maximum value between p and 0. If p is greater than or equal to 0, the output is p itself; otherwise, the output is 0.

Fully Connected Layer
The parameters in the fully connected layer include the total number of fully connected layers and the number of neurons in each individual layer.Increasing the width of the fully connected layer and the number of layers can enhance the model's non-linear expressive power.

Optimized CNNs with ISSA
The basic process of CNNs based on ISSA is illustrated in Figure 4.

Power Transformer Fault Diagnosis Based on Multi-Scale Approximate Entropy and Optimized Deep Convolutional Networks
This study utilizes the optimized convolutional neural network for the analysis of dissolved gases in transformer oil.Initially, eight types of dissolved gases in transformer oil are collected.Subsequently, the gas contents are numerically labeled and normalized.The multi-scale approximate entropy is employed for feature extraction on the pre-processed data.Finally, the extracted features are fed into the optimized convolutional neural network for fault diagnosis.The diagnostic process is illustrated in Figure 5.

Power Transformer Fault Diagnosis Based on Multi-Scale Approximate Entropy and Optimized Deep Convolutional Networks
This study utilizes the optimized convolutional neural network for the analysis of dissolved gases in transformer oil.Initially, eight types of dissolved gases in transformer oil are collected.Subsequently, the gas contents are numerically labeled and normalized.The multi-scale approximate entropy is employed for feature extraction on the pre-processed data.Finally, the extracted features are fed into the optimized convolutional neural network for fault diagnosis.The diagnostic process is illustrated in Figure 5.

Data Preprocessing
The raw data used in this study consist of actual measurements of dissolved gases in transformer oil from a certain substation, totaling 555 sets.Some of the transformer parameters are present in Table 3.

Data Preprocessing
The raw data used in this study consist of actual measurements of dissolved gases in transformer oil from a certain substation, totaling 555 sets.Some of the transformer parameters are present in Table 3.Each set of data includes five features along with the corresponding eight data types, including normal type (NT), high-energy discharge (HD), low-energy discharge (LD), high-temperature overheating (HO), intermediate-temperature overheating (ITO), intermediate-to low-temperature overheating (ILO), low-temperature overheating (LO), and partial discharge (PD).Some of the gas chromatography data are presented in Table 4.Each set of data includes five features along with the corresponding eight data types, including normal type (NT), high-energy discharge (HD), low-energy discharge (LD), hightemperature overheating (HO), intermediate-temperature overheating (ITO), intermediateto low-temperature overheating (ILO), low-temperature overheating (LO), and partial discharge (PD).Some of the gas chromatography data are presented in Table 4. Due to significant differences in gas content values corresponding to different fault types, this study performs standardization on the gas contents using Equation (17).The processed data are presented in Table 5. Due to the correlation between transformer fault types and the corresponding gas content ratios, gas ratios are commonly used as input data in transformer fault diagnosis.In this work, 21 gas ratios were obtained, as shown in Table 6.

Feature Extraction
To extract valuable feature information from the aforementioned gas ratios, multi-scale approximate entropy is introduced to extract characteristic parameters from the gas ratio data.With an approximate entropy scale set to 10, the obtained approximate entropy values for different types of gas contents are illustrated in Figure 6.

Feature Extraction
To extract valuable feature information from the aforementioned gas ratios, multiscale approximate entropy is introduced to extract characteristic parameters from the gas ratio data.With an approximate entropy scale set to 10, the obtained approximate entropy values for different types of gas contents are illustrated in Figure 6. Figure 6 shows that when the scale is greater than 6, the approximate entropy values for different transformer faults are relatively similar, exhibiting the same changing trend.At scale values of 4, 5, and 6, there still exist some fault types with similar approximate entropy values.However, at a scale value of 3, the differences in approximate entropy values among different faults become more distinct.Considering that a low scale may lead to the loss of sample information, the scale is set to 3 in this study.
Taking into account the impact of different embedding dimensions on entropy values, the embedding dimensions range from 2 to 6. Figure 7 presents the comparative results for different fault types under different embedding dimensions.
The impact of the embedding dimension for different fault types is evident from Figure 7.When m takes values of 2, 3, 4, and 6, there is a drastic fluctuation in approximate entropy values, leading to potential confusion between different fault types.However, when m is set to 5, the approximate entropy values for different fault types exhibit a more gradual change with increasing scales.Therefore, m is set to 5. Partial results are presented in Table 8. Figure 6 shows that when the scale is greater than 6, the approximate entropy values for different transformer faults are relatively similar, exhibiting the same changing trend.At scale values of 4, 5, and 6, there still exist some fault types with similar approximate entropy values.However, at a scale value of 3, the differences in approximate entropy values among different faults become more distinct.Considering that a low scale may lead to the loss of sample information, the scale is set to 3 in this study.
Taking into account the impact of different embedding dimensions on entropy values, the embedding dimensions range from 2 to 6. Figure 7 presents the comparative results for different fault types under different embedding dimensions.Figure 6 shows that when the scale is greater than 6, the approximate entropy for different transformer faults are relatively similar, exhibiting the same changin At scale values of 4, 5, and 6, there still exist some fault types with similar appro entropy values.However, at a scale value of 3, the differences in approximate values among different faults become more distinct.Considering that a low scale m to the loss of sample information, the scale is set to 3 in this study.
Taking into account the impact of different embedding dimensions on entro ues, the embedding dimensions range from 2 to 6. Figure 7 presents the compara sults for different fault types under different embedding dimensions.
The impact of the embedding dimension for different fault types is evident fr ure 7. When m takes values of 2, 3, 4, and 6, there is a drastic fluctuation in appr entropy values, leading to potential confusion between different fault types.H when m is set to 5, the approximate entropy values for different fault types exhibit gradual change with increasing scales.Therefore, m is set to 5. Partial results are pr in Table 8.The impact of the embedding dimension for different fault types is evident from Figure 7.When m takes values of 2, 3, 4, and 6, there is a drastic fluctuation in approximate entropy values, leading to potential confusion between different fault types.However, when m is set to 5, the approximate entropy values for different fault types exhibit a more gradual change with increasing scales.Therefore, m is set to 5. Partial results are presented in Table 8.

Optimized CNNs with ISSA
This section utilizes the extracted data to train a CNN model.Initially, the ISSA optimization method is employed to fine-tune the CNN hyperparameters, with a maximum training iteration set to 10.The discoverer's proportion in the population is determined to be 20%.The parameters are presented in Table 9.This section utilizes the extracted data to train a CNN model.Initially, the ISSA optimization method is employed to fine-tune the CNN hyperparameters, with a maximum training iteration set to 10.The discoverer's proportion in the population is determined to be 20%.The parameters are presented in Table 9.From Figure 8, it is evident that, compared to PSO-CNN and SSA-CNN, ISSA-CNN converges more rapidly to a stable fitness value, indicating its superior optimization effectiveness.

Results Analysis
This study analyzes 555 sets of transformer data, comparing the situations before and after feature extraction.Five-fold cross-validation is employed in this study, where the sample data are randomly divided into five equal parts, namely D1, D2, D3, D4, and D5.Each part is used as a test set in turn, while the remaining four parts serve as training sets.The testing results are illustrated in Figure 9. Raw data represent the original data of 21 gas ratios, while MApEn denotes the multi-scale approximate entropy values.The memory consumption before and after feature extraction is presented in Figure 10.The memory consumption of the model before and after optimization with ISSA is presented in Figure 11.
after feature extraction.Five-fold cross-validation is employed in this study, where the sample data are randomly divided into five equal parts, namely D1, D2, D3, D4, and D5.Each part is used as a test set in turn, while the remaining four parts serve as training sets.The testing results are illustrated in Figure 9. Raw data represent the original data of 21 gas ratios, while MApEn denotes the multi-scale approximate entropy values.The memory consumption before and after feature extraction is presented in Figure 10.The memory consumption of the model before and after optimization with ISSA is presented in Figure 11.
Figure 9 indicates that after utilizing multi-scale approximate entropy for feature extraction in transformer data, the diagnostic results for different partitions show better performance compared to the diagnostic performance before feature extraction.It indicates that the feature extraction method in this study can collect valuable transformer data information and eliminate easily confused redundant information.after feature extraction.Five-fold cross-validation is employed in this study, where the sample data are randomly divided into five equal parts, namely D1, D2, D3, D4, and D5.Each part is used as a test set in turn, while the remaining four parts serve as training sets.
The testing results are illustrated in Figure 9. Raw data represent the original data of 21 gas ratios, while MApEn denotes the multi-scale approximate entropy values.The memory consumption before and after feature extraction is presented in Figure 10.The memory consumption of the model before and after optimization with ISSA is presented in Figure 11. Figure 9 indicates that after utilizing multi-scale approximate entropy for feature extraction in transformer data, the diagnostic results for different partitions show better performance compared to the diagnostic performance before feature extraction.It indicates that the feature extraction method in this study can collect valuable transformer data information and eliminate easily confused redundant information.Figure 10 shows that the memory consumption of the model in processing data is much lower after feature extraction compared to that before feature extraction.This indicates that the feature extraction method in this study significantly improves the efficiency of diagnostic operations.
Figure 11 indicates that the memory consumption of the model during fault diagno-  Figure 10 shows that the memory consumption of the model in processing data is much lower after feature extraction compared to that before feature extraction.This indicates that the feature extraction method in this study significantly improves the efficiency of diagnostic operations.
Figure 11 indicates that the memory consumption of the model during fault diagnosis is much lower after ISSA optimization.This means that the ISSA method significantly improves the efficiency of diagnostic operations.
In order to thoroughly validate the superiority of the proposed transformer fault diagnostic model, three algorithms, including BPNN, ELM, and CNN, are introduced in this study for comparative analysis.The confusion matrix obtained through a five-fold cross-validation method is presented in Figure 12. Figure 10 shows that the memory consumption of the model in processing data is much lower after feature extraction compared to that before feature extraction.This indicates that the feature extraction method in this study significantly improves the efficiency of diagnostic operations.
Figure 11 indicates that the memory consumption of the model during fault diagnosis is much lower after ISSA optimization.This means that the ISSA method significantly improves the efficiency of diagnostic operations.
In order to thoroughly validate the superiority of the proposed transformer fault diagnostic model, three algorithms, including BPNN, ELM, and CNN, are introduced in this study for comparative analysis.The confusion matrix obtained through a five-fold crossvalidation method is presented in Figure 12.
In Figure 12, it is evident that different diagnostic methods yield significantly different diagnostic results.As seen in Figure 12a, the diagnostic performance of the BPNN method is relatively poor.Although it accurately identifies data with ITO, it struggles to recognize other types of transformer faults.In Figure 12b, the ELM method improves the diagnostic accuracy but still exhibits noticeable misjudgments, making it difficult to differentiate between ITO and ILO.The results in Figure 12c indicate that the CNN, compared to the first two algorithms, achieves an overall improvement in recognition accuracy.However, there are still clear misjudgments in identifying fault labels.In Figure 12d, it can be observed that the CNN classification model optimized through ISSA demonstrates excellent recognition performance, meeting the engineering requirements.To provide a comprehensive assessment of the proposed model's performance, this paper employs accuracy, precision, recall, F1-score, and Kappa coefficient for analysis.Accuracy is a fundamental metric for evaluating the performance of a classification model, measuring the ratio of correctly classified samples to the total number of samples.Precision represents the proportion of true-positive samples among those predicted as positive.Recall indicates the ratio of samples predicted as positive to all actual positive samples.F1-score is a metric that combines precision and recall, representing their harmonic mean.The Kappa coefficient is a statistical measure of classification model performance, considering the difference between the model's performance and random classification.The Kappa coefficient value ranges from −1 to 1, where 1 signifies perfect agreement, 0 indicates no difference from random classification, and −1 denotes complete disagreement.The calculation method is as shown in Equations ( 18)-( 23).In Figure 12, it is evident that different diagnostic methods yield significantly different diagnostic results.As seen in Figure 12a, the diagnostic performance of the BPNN method is relatively poor.Although it accurately identifies data with ITO, it struggles to recognize other types of transformer faults.In Figure 12b, the ELM method improves the diagnostic accuracy but still exhibits noticeable misjudgments, making it difficult to differentiate between ITO and ILO.The results in Figure 12c indicate that the CNN, compared to the first two algorithms, achieves an overall improvement in recognition accuracy.However, there are still clear misjudgments in identifying fault labels.In Figure 12d, it can be observed that the CNN classification model optimized through ISSA demonstrates excellent recognition performance, meeting the engineering requirements.
To provide a comprehensive assessment of the proposed model's performance, this paper employs accuracy, precision, recall, F1-score, and Kappa coefficient for analysis.Accuracy is a fundamental metric for evaluating the performance of a classification model, measuring the ratio of correctly classified samples to the total number of samples.Precision represents the proportion of true-positive samples among those predicted as positive.Recall indicates the ratio of samples predicted as positive to all actual positive samples.F1-score is a metric that combines precision and recall, representing their harmonic mean.The Kappa coefficient is a statistical measure of classification model performance, considering the difference between the model's performance and random classification.The Kappa coefficient value ranges from −1 to 1, where 1 signifies perfect agreement, 0 indicates no difference from random classification, and −1 denotes complete disagreement.The calculation method is as shown in Equations ( 18)- (23).
where TP represents true positive, TN represents true negative, FP represents false positive, and FN represents false negative.P(E) is the expected accuracy, calculated in Equation ( 23), representing the performance under random conditions.
P(E) = (TP + FP) * (TP + FN) + (TN + FN) * (TN + FP) The diagnostic results for different methods are illustrated in Figure 13.The box in the figure represents the interquartile range from the upper quartile to the lower quartile.The upper and lower whiskers, respectively, depict the maximum and minimum values.The median point represents the middle value, indicating the average level of the metrics calculated by this method.
Figure 13a reveals significant differences in fault accuracy among the four diagnostic methods, and the ISSA-CNN method exhibits a distinct advantage compared to the other three methods.The diagnostic results in Figure 13b indicate that ISSA-CNN achieves higher precision in terms of maximum, minimum, and average values, demonstrating superior recognition performance within the limited transformer data range.The results in Figure 13c suggest that the recall rate of the proposed method is higher, indicating a greater number of correctly predicted samples and a clear advantage in diagnostic effectiveness.As shown in Figure 13d, the F1-scores obtained by ISSA-CNN are distributed above 85%, indicating excellent generalization performance.Figure 13e demonstrates that the Kappa coefficient of ISSA-CNN has a minimum value and overlapping boxes, indicating a stable distribution range and high classification accuracy.Figure 13a reveals significant differences in fault accuracy among the four diagnostic methods, and the ISSA-CNN method exhibits a distinct advantage compared to the other three methods.The diagnostic results in Figure 13b indicate that ISSA-CNN achieves higher precision in terms of maximum, minimum, and average values, demonstrating superior recognition performance within the limited transformer data range.The results in Figure 13c suggest that the recall rate of the proposed method is higher, indicating a greater number of correctly predicted samples and a clear advantage in diagnostic effectiveness.As shown in Figure 13d, the F1-scores obtained by ISSA-CNN are distributed above 85%, indicating excellent generalization performance.Figure 13e demonstrates that the Kappa coefficient of ISSA-CNN has a minimum value and overlapping boxes, indicating a stable distribution range and high classification accuracy.

Conclusions
Building upon the analysis of dissolved gases in transformer oil, this study proposes the ISSA-CNN model for transformer fault diagnosis.The conclusions are as follows.

1.
This study introduces an improved sparrow search algorithm that incorporates enhancement strategies in population initialization and position updating.The effectiveness of the enhanced algorithm is validated through optimizing test functions.The algorithm is then applied to optimize the hyperparameters of CNNs.Comparative analysis with different optimization algorithms and validation on the DGA dataset demonstrates its superiority.

2.
This study analyzes eight different types of transformer oil and gas data, deriving 21 gas ratios.Subsequently, multi-scale approximate entropy is calculated for these gas ratio contents.The uncertainty of dissolved gases in transformer oil is represented by entropy values, and the multi-scale approximate entropy values are used as feature vectors input into the optimized CNN diagnostic model.The results indicate that the extracted multi-scale approximate entropy can effectively characterize dissolved gas contents and improve the diagnostic effectiveness.3.
To verify the effectiveness and superiority of the proposed method, this study compares it with BPNN, ELM, and CNNs.The results show that the ISSA-CNN transformer fault diagnosis model outperforms the other three methods in terms of accuracy, recall rate, precision, F1-score, and Kappa coefficient.This indicates that the proposed method has good generalization performance and demonstrates favorable application effects in transformer fault diagnosis.
In the future: the authors will attempt to collect more on-site transformer fault data to validate the effectiveness and practicality of the proposed model.Additionally, further improvements can be made to better optimize the parameters of the convolutional neural network and enhance the robustness and stability of the model.

b
is the bias, l j β means the multiplicative residual, and M l represents the size of the pooling window.2.3.3.Activation LayerCNNs are composed of multiple layers of composite functions.Rectified Linear Unit (ReLU) is a widely used activation function in CNNs.Its sparse representation can accelerate learning and simplify models.The mathematical expression of the ReLU function is as follows.

Figure 4 .
Figure 4.The process of CNNs based on ISSA.

Figure 4 .
Figure 4.The process of CNNs based on ISSA.

Figure 6 .
Figure 6.Approximate entropy values varying with different scales.

Figure 6 .
Figure 6.Approximate entropy values varying with different scales.

Figure 6 .
Figure 6.Approximate entropy values varying with different scales.

Figure 8
Figure 8 depicts the fitness curves obtained through testing with PSO-CNN, SSA-CNN, and ISSA-CNN, respectively.

Figure 8
Figure 8 depicts the fitness curves obtained through testing with PSO-CNN, SSA-CNN, and ISSA-CNN, respectively.

Figure 8 .
Figure 8. Fitness curve of different optimization methods.

Figure 8 .
Figure 8. Fitness curve of different optimization methods.

Figure 10 .Figure 9 .
Figure 10.The memory consumption before and after feature extraction.

Figure 11 .
Figure 11.The memory consumption before and after optimization via ISSA.

Figure 9
Figure 9 indicates that after utilizing multi-scale approximate entropy for feature extraction in transformer data, the diagnostic results for different partitions show better performance compared to the diagnostic performance before feature extraction.It indicates that the feature extraction method in this study can collect valuable transformer data information and eliminate easily confused redundant information.Figure10shows that the memory consumption of the model in processing data is much lower after feature extraction compared to that before feature extraction.This indicates that the feature extraction method in this study significantly improves the efficiency of diagnostic operations.Figure11indicates that the memory consumption of the model during fault diagnosis is much lower after ISSA optimization.This means that the ISSA method significantly improves the efficiency of diagnostic operations.In order to thoroughly validate the superiority of the proposed transformer fault diagnostic model, three algorithms, including BPNN, ELM, and CNN, are introduced in this study for comparative analysis.The confusion matrix obtained through a five-fold cross-validation method is presented in Figure12.

Figure 11 .
Figure 11.The memory consumption before and after optimization via ISSA.

Table 1 .
Comparison of different diagnostic methods.
2.3.3.Activation LayerCNNs are composed of multiple layers of composite functions.Rectified Linear Unit (ReLU) is a widely used activation function in CNNs.Its sparse representation can f (p) = max(0, p)

Table 8 .
Partial multi-scale entropy value results.

Table 8 .
Partial multi-scale entropy value results.