Defect Identification and Diagnosis for Distribution Network Electrical Equipment Based on Fused Image and Voiceprint Joint Perception

Chen, An; Liu, Junle; Liu, Silin; Fan, Jinchao; Liao, Bin

doi:10.3390/en18133451

Open AccessArticle

Defect Identification and Diagnosis for Distribution Network Electrical Equipment Based on Fused Image and Voiceprint Joint Perception

by

An Chen

^1,*,

Junle Liu

¹,

Silin Liu

¹,

Jinchao Fan

² and

Bin Liao

²

¹

China Southern Power Grid Guangdong Zhongshan Power Supply Bureau, Zhongshan 528400, China

²

School of Electrical and Electronic Engineering, North China Electric Power University, Beijing 102206, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(13), 3451; https://doi.org/10.3390/en18133451

Submission received: 30 April 2025 / Revised: 26 May 2025 / Accepted: 13 June 2025 / Published: 30 June 2025

(This article belongs to the Topic Intelligent, Flexible, and Effective Operation of Smart Grids with Novel Energy Technologies and Equipment)

Download

Browse Figures

Versions Notes

Abstract

As the scale of distribution networks expand, existing defect identification methods face numerous challenges, including limitations in single-modal feature identification, insufficient cross-modal information fusion, and the lack of a multi-stage feedback mechanism. To address these issues, we first propose a joint perception of image and voiceprint features based on bidirectional coupled attention, which enhances deep interaction across modalities and overcomes the shortcomings of traditional methods in cross-modal fusion. Secondly, a defect identification and diagnosis method of distribution network electrical equipment based on two-stage convolutional neural networks (CNN) is introduced, which makes the network pay more attention to typical and frequent defects, and enhances defect diagnosis accuracy and robustness. The proposed algorithm is compared with two baseline algorithms. Baseline 1 is a long short term memory (LSTM)-based algorithm that performs separate feature extraction and processing for image and voiceprint signals without coupling the features of the two modalities, and Baseline 2 is a traditional CNN algorithm that uses classical convolutional layers for feature learning and classification through pooling and fully connected layers. Compared with two baselines, simulation results demonstrate that the proposed method improves accuracy by 12.1% and 33.7%, recall by 12.5% and 33.1%, and diagnosis efficiency by 22.92% and 60.42%.

Keywords:

distribution network; electrical equipment; defect identification and diagnosis; image and voiceprint joint perception; CNN

1. Introduction

Accurate and efficient defect identification and diagnosis of electrical equipment in distribution networks is the foundation for handling distribution network accidents, rapidly restoring power, and maintaining reliable operation [1]. With the advancement of modern power systems, the increasing deployment of power electronic devices, along with the growing proportion of high-voltage direct current (DC) systems and renewable energy, the scale and structure of distribution networks becomes increasingly complex. Moreover, factors such as adverse weather and pollution have increased the risk of faults, leading to frequent power outages and equipment damage, which severely impacts the stability of the power system [2,3,4]. Certain critical faults, such as high impedance faults (HIFs), are difficult to detect because they generate low fault currents and their arc characteristics resemble those of normal operations (such as load switch operations). These factors demand higher standards for defect identification and diagnosis of electrical equipment in distribution networks. By monitoring equipment images (such as ultraviolet, infrared, visible light, etc., which reflect external conditions such as cracks, corrosion, etc., [5]) or voiceprints (which reflect abnormal noise during operation [6]), the working state of equipment can be identified, allowing for the timely detection and resolution of potential defects, thereby effectively avoiding faults. However, due to the complex and dynamic environment of distribution networks, including strong electromagnetic interference, equipment vibration, environmental noise, and other factors, single-modal defect identification technologies face many challenges. For example, image feature identification has a high misjudgment rate under low light or strong light reflection, while voiceprint identification suffers from decreased accuracy under complex noise interference. Additionally, different equipment states and fault modes depend on single-modal features to different extents. For instance, in a humid environment, partial discharge activity increases, and voiceprint analysis of partial discharge signals is more effective. In contrast, defects such as surface cracks or corrosion are more reliant on image feature identification. Therefore, how to integrate image and voiceprint features under complex conditions, fully utilizing their complementarity, and establish a full-stage identification and diagnosis mechanism for comprehensive and precise monitoring of distribution network equipment defects, remains a key technical challenge in the field of electrical equipment defect identification in distribution networks.

By extracting core information from electrical equipment images and voiceprint features, the accuracy and robustness of defect identification can be improved. Researchers have conducted extensive research. A new method for power equipment fault diagnosis based on infrared thermal imaging was proposed [7]. The you only look once version 5 (YOLOv5) convolutional neural network is adopted and enhanced to improve the efficiency and accuracy of defect diagnosis. In [8], Kong et al. developed a logistic membership function-based algorithm for infrared image enhancement. This method reduces thermal image noise and improves contrast through blur processing, enabling efficient batch processing and power equipment fault diagnosis. A data augmentation method based on a cycle consistent generative adversarial network (CycleGAN) for visible light image features was proposed [9], which converts easily obtainable visible light images into difficult-to-obtain infrared images, achieving the expansion of a power equipment infrared image pool. A transformer fault voiceprint identification model based on Mel frequency domain coefficients (MFCC) and deep learning was proposed [10]. This model greatly enhanced the accuracy of fault identification for transformer acoustic signals in a low signal-to-noise ratio environment. In [11], Satea et al. introduced a method for detecting high impedance faults, which combines third harmonic current/voltage amplitude variations with wavelet analysis. This approach enhances detection reliability and significantly reduces the risk of mistaking normal events for faults. Although the above studies have made some progress, they mostly focus on single-modal data and fail to fully utilize the fusion of image and voiceprint features. This limitation impedes accurate physical state capture in electrical equipment, degrading fault identification performance. Meanwhile, feature fusion based on attention mechanisms has shown excellent potential in fault diagnostics and analysis of multi-source heterogeneous data. It has become an important development direction for defect identification and diagnosis. A fault identification and diagnosis method for unmanned aerial vehicles (UAVs) was proposed in [12]. Through intelligent analysis and real-time monitoring of flight images and voiceprint data, efficient fault diagnosis of UAV flight status is achieved. However, these methods overlook horizontal–vertical attention coupling. This limitation prevents effective capture of image–voiceprint interactive features, leading to slow convergence and low accuracy. Consequently, distribution network equipment defects lack rapid, precise detection and diagnosis.

Artificial intelligence (AI) is increasingly being applied across various fields, providing creative solutions to tackle complex problems. For example, in the healthcare industry [13], Kh-Madhloom et al. introduced a multi-layer encryption system combining deoxyribonucleic acid (DNA) computing and advanced encryption standard (AES) algorithms to improve the security of patient healthcare images, showcasing the potential of AI in safeguarding sensitive medical data. In video compression [14], Hassen et al. developed an optimal video frame compression model based on a quantum genetic algorithm, utilizing quantum machine learning techniques to optimize video multicast efficiency over the internet.

Similarly, in the field of defect identification and diagnosis for electrical equipment in distribution networks, the application of AI technology, particularly convolutional neural networks (CNN), is crucial. CNN-based techniques, such as the fusion of image and voiceprint features, enable automatic defect identification and diagnosis in electrical equipment. This approach significantly enhances the efficiency and safety of equipment maintenance, reduces operational risks, and helps lower maintenance costs by leveraging the powerful capabilities of deep learning in image and audio data analysis. In [15], Qiu et al. fused CNN-extracted image features via adaptive weights for equipment state assessment, effectively reducing maintenance costs. In [16], Xing et al. combined 1D-CNN-processed gas data and residual-compressed infrared imaging with cross-modal attention, achieving precise transformer health evaluation. MFCC combined with CNN were used to effectively diagnose abnormal states in transformers, achieving an accuracy of 98% under low noise conditions [17]. A multi-source data fusion algorithm, integrating multi-source feature extraction and a multi-scale CNN fusion network, was introduced to simultaneously process image and voiceprint data, ensuring precise equipment failure monitoring [18]. However, existing methods typically employ single-stage CNN networks, lacking multi-stage adaptive adjustment mechanisms, and cannot dynamically update attention weights based on temporal features, which limits the model’s focus on typical and frequent defects. Additionally, existing methods do not incorporate closed-loop feedback mechanisms, making it difficult to achieve precise diagnosis in complex environments, leading to insufficient diagnostic accuracy and robustness.

Despite progress in existing research, defect identification and diagnosis of electrical equipment in distribution networks still face the following challenges. Firstly, traditional defect identification and diagnosis methods rely only on single-modal data, failing to fully extract the rich information contained in different modalities and neglecting cross-modal interactions and correlations. This leads to low identification and diagnosis accuracy and increased risks of misdetection or missed detection. Secondly, existing defect identification and diagnosis methods based on image and voiceprint fusion do not consider the bidirectional coupling of horizontal and vertical attention. This results in insufficient fine control of modality interactions, limiting the effective transmission of cross-modal information and further affecting identification and diagnosis efficiency. Lastly, current defect identification and diagnosis methods often use single CNN networks, lacking adaptive adjustments and closed-loop feedback mechanisms in multi-stage networks, and neglecting temporal features. This leads to low identification accuracy, poor stability, and insufficient intelligent early warning capabilities, making it unsuitable for electrical equipment defect identification scenarios involving image and voiceprint data fusion.

In response to these challenges, we propose a joint perception method for image and voiceprint features based on bidirectional coupled attention to improve the efficiency of defect identification and diagnosis. Secondly, a two-stage CNN based method for identifying and diagnosing defects in distribution network electrical equipment is proposed to improve the accuracy and robustness of diagnosis.

The main contributions of this paper are as follows:

Joint Perception of Image and Voiceprint Features based on Bidirectional Coupled Attention: Firstly, an initial feature extraction of image and voiceprint data is performed using an LSTM network. Then, the features from the two modalities are concatenated and coupled through affine transformation. Subsequently, bidirectional coupled attention is constructed for both image and voiceprint modalities, including vertical self-attention and horizontal coupled attention. Vertical self-attention uses a multi-head attention mechanism to adapt to more complex mapping relationships. Horizontal coupled attention integrates information from other modalities when computing attention scores, enabling deep interaction across modalities. This prevents the inefficiency of relying solely on single-modal information and bridges the semantic gap between modalities, improving the efficiency of defect identification and diagnosis.
Defect Identification and Diagnosis of Distribution Network Electrical Equipment based on Two-stage CNN: Firstly, a preliminary defect identification method based on a temporal-adaptive CNN is proposed, which analyzes electrical equipment feature data and adapts the attention weights based on temporal features, such as the time intervals and correlations of the data. This improves the network’s focus on typical and frequently occurring defects, enhancing the adaptability of the preliminary identification stage. Secondly, a precise diagnostic method based on closed-loop CNN is introduced. This method adjusts the diagnostic network using closed-loop feedback based on actual defect conditions and preliminary identification results, achieving refined diagnosis in complex environments, thus significantly improving diagnostic accuracy and robustness.

2. Joint Perception of Image and Voiceprint Features Based on Bidirectional Coupled Attention

A joint perception of image and voiceprint features based on bidirectional coupled attention is proposed, as shown in Figure 1. On the basis of traditional feature extraction research, waveform and temporal features of image and voiceprint information are obtained. Affine transformation is used to combine the two types of features of image and voiceprint to obtain feature vectors containing waveform and temporal information. To ensure there is no conflict and to guarantee convergence when two sets of inputs feed into the interaction information, we have implemented a comprehensive approach that includes affine transformation, concatenation, and multi-head attention mechanisms. In the subsequent defect identification and diagnosis of electrical equipment, temporal information is used to improve model performance.

2.1. Extraction and Construction of Image and Voiceprint Features

Firstly, based on LSTM network, initial features of defects in distribution network electrical equipment contained in images and voiceprint signals are extracted. Assuming the time slot set is

T = {1, \dots, t, \dots T}

, where

t

is the slot index and

T

is the total number of slots. The process of extracting image signal features can be expressed as

\{\begin{cases} V_{I} (t) = s i g m o i d (η_{V}^{I} [S_{I} (t - 1), I N_{I} (t)] + λ_{V}^{I}) \\ J_{I} (t) = s i g m o i d (η_{J}^{I} [S_{I} (t - 1), I N_{I} (t)] + λ_{J}^{I}) \\ O_{I} (t) = s i g m o i d (η_{O}^{I} [S_{I} (t - 1), I N_{I} (t)] + λ_{O}^{I}) \\ Q_{I} (t) = V_{I} (t) Q_{I} (t - 1) + Q_{I} (t) \tanh (η_{Q}^{I} [S_{I} (t - 1), I N_{I} (t)] + λ_{Q}^{I}) \\ S_{I} (t) = O_{I} (t) \tanh (Q_{I} (t)) \end{cases}

(1)

where:

V_{I} (t)

,

J_{I} (t)

, and

O_{I} (t)

represent the state information of the forget gate, input gate, and output gate of the image signal feature extraction LSTM network, respectively.

η_{V}^{I}

,

η_{J}^{I}

,

η_{O}^{I}

, and

η_{Q}^{I}

represent the weight parameters of the forget gate, input gate, output gate, and candidate cell of the image signal feature extraction LSTM network, respectively.

λ_{V}^{I}

,

λ_{J}^{I}

,

λ_{O}^{I}

, and

λ_{Q}^{I}

are the biases, respectively.

I N_{I} (t)

denotes the input to the image signal feature extraction LSTM network, i.e., the image signal input.

S_{I} (t - 1)

is the hidden layer output of the image signal feature extraction LSTM network in slot

t - 1

.

Q_{I} (t)

is the candidate cell state of the image signal feature extraction LSTM network in time slot

t

.

s i g m o i d

and

\tanh

are the activation functions used to update the state information of the forget gate, input gate, and output gate, as well as to create the new state information for the candidate cell.

The process of extracting voiceprint signal features can be expressed as

\{\begin{cases} V_{A} (t) = s i g m o i d (η_{V}^{A} [S_{A} (t - 1), I N_{A} (t)] + λ_{V}^{A}) \\ J_{A} (t) = s i g m o i d (η_{J}^{A} [S_{A} (t - 1), I N_{A} (t)] + λ_{J}^{A}) \\ O_{A} (t) = s i g m o i d (η_{O}^{A} [S_{A} (t - 1), I N_{A} (t)] + λ_{O}^{A}) \\ Q_{A} (t) = V_{A} (t) Q_{A} (t - 1) + Q_{A} (t) \tanh (η_{Q}^{A} [S_{A} (t - 1), I N_{A} (t)] + λ_{Q}^{A}) \\ S_{A} (t) = O_{A} (t) \tanh (Q_{A} (t)) \end{cases}

(2)

where:

V_{A} (t)

,

J_{A} (t)

, and

O_{A} (t)

represent the state information of the forget gate, input gate, and output gate of the voiceprint signal feature extraction LSTM network, respectively.

η_{V}^{A}

,

η_{J}^{A}

,

η_{O}^{A}

, and

η_{Q}^{A}

represent the weight parameters of the forget gate, input gate, output gate, and candidate cell of the voiceprint signal feature extraction LSTM network, respectively.

λ_{V}^{A}

,

λ_{J}^{A}

,

λ_{O}^{A}

, and

λ_{Q}^{A}

are the biases of the forget gate, input gate, output gate, and candidate cell of the voiceprint signal feature extraction LSTM network.

I N_{A} (t)

is the input to the voiceprint signal feature extraction LSTM network.

S_{A} (t - 1)

denotes the output of the hidden layer of the voiceprint signal feature extraction LSTM network in slot

t - 1

.

Q_{A} (t - 1)

is the candidate cell state of the voiceprint signal feature extraction LSTM network in time slot

t - 1

.

2.2. Construction of Bidirectional Coupled Attention

Consider that the initial defect features of the distribution network electrical equipment extracted through the image signal feature extraction LSTM network include waveform feature

N_{I} (t)

and temporal feature

τ_{I} (t)

. The initial defect features extracted through the voiceprint signal feature extraction LSTM network include waveform feature

N_{A} (t)

and temporal feature

τ_{I} (t)

. These waveform and temporal features from both the image and voiceprint modalities are then concatenated and coupled through affine transformation.

First, the temporal feature vectors of the two modalities are concatenated into a matrix, which is then multiplied by a weight vector

θ

and fused into a multimodal offset

b (t)

, expressed as

b (t) = [τ_{I} (t), τ_{A} (t)] θ

(3)

Then,

b (t)

is used to provide a unified offset for each modality to perform affine transformation. Then, each of the modal representations containing temporal information are obtained, expressed as

\{\begin{cases} Q_{I} (t) = W_{I}^{Q} N_{I} (t) + b (t) \\ K_{I} (t) = W_{I}^{K} N_{I} (t) + b (t) \\ V_{I} (t) = W_{I}^{V} N_{I} (t) + b (t) \end{cases}

(4)

\{\begin{cases} Q_{A} (t) = W_{A}^{Q} N_{A} (t) + b (t) \\ K_{A} (t) = W_{A}^{K} N_{A} (t) + b (t) \\ V_{A} (t) = W_{A}^{V} N_{A} (t) + b (t) \end{cases}

(5)

where:

Q_{I} (t)

,

K_{I} (t)

, and

V_{I} (t)

denote the query vector, key vector, and value vector of image modality in time slot

t

, respectively.

Q_{A} (t)

,

K_{A} (t)

, and

V_{A} (t)

represent the query vector, key vector, and value vector of voiceprint modality in time slot

t

, respectively.

W_{I}^{Q}

,

W_{I}^{K}

, and

W_{I}^{V}

are the three weight matrices of the image modality, respectively.

W_{A}^{Q}

,

W_{A}^{K}

, and

W_{A}^{V}

are the three weight matrices of the voiceprint modality, respectively.

Next, bidirectional coupled attention is constructed separately for the image and voiceprint modalities, including vertical self-attention and horizontal coupled attention.

Vertical self-attention adopts a multi-head attention mechanism to accommodate more complex mapping relationships. The attention scores between the query vector and the key vector are computed using the dot product function and the softmax function. These scores are used to obtain the weighted sum of the value vectors, resulting in the modality representation after the self-attention calculation, expressed as

A t t e n t i o n (Q_{I} (t), K_{I} (t), V_{I} (t)) = softmax ((Q_{I} (t) K_{I}^{T} (t)) / \sqrt{d_{I}}) W^{o}

(6)

A t t e n t i o n (Q_{A} (t), K_{A} (t), V_{A} (t)) = softmax ((Q_{A} (t) K_{A}^{T} (t)) / \sqrt{d_{A}}) W^{o}

(7)

where:

A t t e n t i o n (Q_{I} (t), K_{I} (t), V_{I} (t))

is the result of the self-attention operation of the image modality in time slot

t

.

d_{I}

is the output dimension of the vertical self-attention in the image modality.

d_{A}

is the output dimension of the vertical self-attention in the voiceprint modality.

W^{o}

is the linear transformation of the output.

The above operations are repeated

h

times, with each iteration considered a head. The results of these

h

times are concatenated and linearly transformed to obtain the result of multi-head attention. To prevent the inner product from becoming too large, the computed similarity weight is typically divided by the key vector’s dimension, with each head having distinct linear transformation parameters. The specific process is expressed as

H_{h e a d_{j}}^{I} (t) = A t t e n t i o n (Q_{I} (t) W_{j}^{Q}, K_{I} (t) W_{j}^{K}, V_{I} (t) W_{j}^{V})

(8)

H_{h e a d_{j}}^{A} (t) = A t t e n t i o n (Q_{A} (t) W_{j}^{Q}, K_{A} (t) W_{j}^{K}, V_{A} (t) W_{j}^{V})

(9)

M H A (Q_{I} (t), K_{I} (t), V_{I} (t)) = C o n c a t (H_{h e a d_{1}}^{I}, \dots, H_{h e a d_{j}}^{I}, \dots, H_{h e a d_{h}}^{I}) W^{o}

(10)

M H A (Q_{A} (t), K_{A} (t), V_{A} (t)) = C o n c a t (H_{h e a d_{j}}^{A}, \dots, H_{h e a d_{j}}^{A}, \dots, H_{h e a d_{j}}^{A}) W^{o}

(11)

where:

H_{h e a d_{j}}^{I} (t)

is the head obtained from the image modality’s vertical self-attention operation at the

j

-th step in time slot

t

.

H_{h e a d_{j}}^{A} (t)

is the head obtained from the voiceprint modality’s vertical self-attention operation at the

j

-th step in time slot

t

.

W_{j}^{Q}

,

W_{j}^{K}

, and

W_{j}^{V}

represent the weight matrices for the query vector, key vector, and value vector at the

j

-th operation.

M H A (Q_{I} (t), K_{I} (t), V_{I} (t))

represents the vertical self-attention of the image modality in time slot

t

.

M H A (Q_{A} (t), K_{A} (t), V_{A} (t))

represents the vertical self-attention of the voiceprint modality in time slot

t

.

C o n c a t (\cdot)

indicates the concatenation of features.

2.3. Cross-Modal Deep Interaction

Horizontal coupled attention introduces information from other modalities when calculating the attention score. Through cross-modal information interaction, bridging the semantic gap between modalities, the efficiency of defect identification and diagnosis is improved. The horizontal coupled attention of the image modality calculates the interactive attention score of the image modality and the voiceprint modality, which is similar to the vertical self-attention calculation process of the image modality, but the query vector comes from the voiceprint modality. The query vector of the voiceprint modality is used to guide the multi-head attention of the image modality, so that the final obtained modal representation can contain the interaction information between the modalities, which is specifically expressed as

C r o s s A t t e n t i o n (Q_{A} (t), K_{I} (t), V_{I} (t)) = softmax ((Q_{A} (t) K_{I}^{T} (t)) / \sqrt{d_{I}}) W^{o}

(12)

H_{h e a d_{j}}^{I, Cross} (t) = C r o s s A t t e n t i o n (Q_{A} (t) W_{j}^{Q}, K_{I} (t) W_{j}^{K}, V_{I} (t) W_{j}^{V})

(13)

C r o s s M H A (Q_{A} (t), K_{I} (t), V_{I} (t)) = C o n c a t (H_{h e a d_{1}}^{I, Cross}, \dots, H_{h e a d_{j}}^{I, Cross}, \dots, H_{h e a d_{h}}^{I, Cross}) W^{o}

(14)

where:

C r o s s A t t e n t i o n (Q_{A} (t), K_{I} (t), V_{I} (t))

denotes the result of the interaction attention operation of the image modality.

H_{h e a d_{j}}^{I, Cross} (t)

represents the head obtained from the image modality’s horizontal coupled attention operation at the

j

-th step in time slot

t

.

C r o s s M H A (Q_{A} (t), K_{I} (t), V_{I} (t))

represents the horizontal coupled self-attention for the image modality in time slot

t

.

Similarly, the horizontal coupled attention computation process for the voiceprint modality is expressed as

C r o s s A t t e n t i o n (Q_{I} (t), K_{A} (t), V_{A} (t)) = softmax ((Q_{I} (t) K_{A}^{T} (t)) / \sqrt{d_{A}}) W^{o}

(15)

H_{h e a d_{j}}^{A, Cross} (t) = C r o s s A t t e n t i o n (Q_{I} (t) W_{j}^{Q}, K_{A} (t) W_{j}^{K}, V_{A} (t) W_{j}^{V})

(16)

C r o s s M H A (Q_{I} (t), K_{A} (t), V_{A} (t)) = C o n c a t (H_{h e a d_{1}}^{A, Cross}, \dots, H_{h e a d_{j}}^{A, Cross}, \dots, H_{h e a d_{h}}^{A, Cross}) W^{o}

(17)

where:

C r o s s A t t e n t i o n (Q_{I} (t), K_{A} (t), V_{A} (t))

represents the result of the interaction attention operation for the image modality in time slot

t

.

H_{h e a d_{j}}^{A, Cross} (t)

represents the head obtained from the image modality’s horizontal coupled attention operation at the

j

-th step in time slot

t

.

C r o s s M H A (Q_{I} (t), K_{A} (t), V_{A} (t))

represents the horizontal coupled self-attention for the image modality in time slot

t

.

3. Defect Identification and Diagnosis of Distribution Network Electrical Equipment Based on Two-Stage CNN

To enable defect identification and diagnosis in distribution network electrical equipment through the joint perception of image and voiceprint data, a feature identification method for image and voiceprint data fusion based on a feature temporal attention mechanism must be explored. The schematic diagram is shown in Figure 2. Firstly, the bidirectional coupling attention obtained from the feature joint perception method in the previous section is input into the CNN network for preliminary identification of image and voiceprint data features. Secondly, the feature identification and fusion of joint perception data are carried out based on the closed-loop CNN network, enabling the closed-loop update of fine diagnostic parameters. This two-stage approach ensures effective backpropagation through the entire network, allowing the model to iteratively refine its parameters based on feedback from both stages. Compared to single CNNs with multiple layers, our two-stage CNN architecture offers higher accuracy and better robustness in defect identification, especially in complex environments.

3.1. Preliminary Identification of Image and Voiceprint Data Based on Attention Mechanism with Time Sequence Characteristics

In this section, the CNN network is introduced to perform preliminary identification of data features for image and voiceprint joint perception. CNN, a deep learning model, has achieved significant success in computer vision. Its end-to-end structure enables effective feature extraction.

The raw electrical equipment defect data are extracted by a convolutional layer with relevant features. These features are further compressed through the pooling layer, reducing the size of the feature image and voiceprint while enhancing learning robustness. Global features are extended into matrices and fed into fully connected layers for classification or regression. Therefore, CNN significantly reduces the complexity of the network structure through weight self-adjustment. Figure 3 shows the flow chart of the defect identification and diagnosis for distribution network electrical equipment based on two-stage CNN.

The image and voiceprint joint perception data are input into the convolutional layer. The convolutional layer extracts features from the input data, generating a feature map using a filter. The convolution operation is performed using the Swish activation function, which is expressed as

S w i s h (x) = \frac{x}{1 + e^{- x}}

(18)

A new feature map is generated from the features extracted by the convolutional layer. The pooling layer removes redundant features while preserving prominent ones, helping to improve CNN accuracy. The fully connected layer consists of densely interconnected neurons in a feedforward network. Thus, the fully connected layer ultimately receives the features extracted by the convolutional and pooling layers as inputs.

Next, we perform preliminary identification of image and voiceprint data features based on CNN. First, we acquire the image data

X_{1} (t) = {x_{I_{1}} (t), x_{I_{2}} (t), \dots, x_{I_{n_{1}}} (t)}

and voiceprint data

X_{2} (t) = {x_{A_{1}} (t), x_{A_{2}} (t), \dots, x_{A_{n_{2}}} (t)}

based on joint perception. Here,

n_{1}

represents the number of image data streams, and

n_{2}

represents the number of voiceprint data streams. Using CNN, we perform preliminary identification of the image and voiceprint fusion data features, and the resulting preliminary identification data features are expressed as

Y_{i}^{f} (t) = Γ_{CNN} (W_{0} \cdot A (t) \cdot S w i s h (X_{i} (t)) + b) = {y_{i, 1} (t), y_{i, 2} (t), \dots, y_{i, n_{i}} (t)} .

(19)

where:

i \in {I, A}

represents the data modality label.

{y_{i, 1} (t), y_{i, 2} (t), \dots, y_{i, n_{i}} (t)}

is the set of data features identified in time slot

t

.

n_{i} (t)

is the data traffic in time slot

t

, i.e., the number of preliminary identified features.

y_{i, q} (t)

is the feature identified for the

q

-th data stream in time slot

t

.

Γ_{CNN} (\cdot)

denotes the output function of CNN.

S w i s h (\cdot)

denotes the activation function.

b

is the bias term of the convolutional neural network.

W_{0}

is the initial attention coefficient, and

A (t)

is the multi-channel coupling temporal interaction attention matrix in time slot

t

, generated by the bidirectional coupling attention group layer discussed in Section 2, which can be expressed as

A (t) = [\begin{array}{l} M H A (Q_{I} (t), K_{I} (t), V_{I} (t)) & C r o s s M H A (Q_{I} (t), K_{A} (t), V_{A} (t)) \\ C r o s s M H A (Q_{A} (t), K_{I} (t), V_{I} (t)) & M H A (Q_{A} (t), K_{A} (t), V_{A} (t)) \end{array}]

(20)

For convenience in representation, the head index in Section 2 is replaced by the data stream index in this section.

A_{k} (t)

is divided into the self-attention weight set

A_{i}^{o} (t)

and the interaction attention set

A_{i}^{c} (t)

, where

A_{i}^{o} (t) = {A_{i, 1}^{o} (t), A_{i, 2}^{o} (t), \dots, A_{i, q}^{o} (t), \dots, A_{i, n_{i}}^{o} (t)}

and

A_{i}^{c} (t) = {A_{i, 1}^{c} (t), A_{i, 2}^{c} (t), \dots, A_{i, q}^{c} (t), \dots, A_{i, n_{i}}^{c} (t)}

.

A_{i, q}^{o} (t)

and

A_{i, q}^{c} (t)

are the self-attention weights and interaction attention weights for the

q

-th data stream of data modality

i

in time slot

t

. The subsequent discussion will adopt this form for definition.

Finally, the feature set of the whole data of image and voiceprint fusion can be obtained, which can be expressed as

Y_{a l l} (t) = \{Y_{i}^{f} (t) |i = 1, 2\}

(21)

To further improve the attention of the identification network to typical and frequent defects and enhance the adaptability of preliminary identification, we introduce the attention weight of temporal interaction. First, we calculate the similarity of the data features from the fusion of image and voiceprint for the

n_{i} (t)

-th data stream in time slot

t

, which can be expressed as

p_{j, k}^{i} (t) = σ_{1} (y_{i, j} (t) ⊙ y_{i, k} (t)) + σ_{2} (y_{i, j} (t) \otimes y_{i, k} (t))

(22)

where:

p_{j, k}^{i} (t)

represents the similarity between the pre-identified features of the

j

-th and

k

-th

(j \neq k)

channels in time slot

t

.

y_{n, g} (t)

and

y_{n, j} (t)

represent the pre-identified features of the

g

-th and

j

-th channels in the identification zone n in time slot

t

, respectively.

σ_{1}

and

σ_{2}

are the weight coefficients.

⊙

is the operator for calculating the waveform similarity between features, and

\otimes

is the operator for calculating the temporal similarity between features. Specifically, the smaller the time interval between the earlier start time and the later end time of two features, the greater their temporal similarity is.

If the similarity between two features is greater than the threshold

p_{0}

, i.e.,

p_{j, k}^{i} (t) > p_{0}

, then the two features are considered similar. For a series of features, if all features are pairwise similar, they form a feature group. For example, if

R {p_{1, 2}^{1} (t), p_{1, 3}^{1} (t), p_{2, 3}^{1} (t)} > p_{0}

, then the 1st, 2nd, and 3rd features of the image data are considered to belong to the same feature group, where

R {p_{1, 2}^{1} (t), p_{1, 3}^{1} (t), p_{2, 3}^{1} (t)}

represents the average value of

p_{1, 2}^{1} (t), p_{1, 3}^{1} (t), p_{2, 3}^{1} (t)

.

Similarly, for several sets of feature groups obtained from image data and voiceprint data, the interaction similarity is solved across modalities to obtain the set of interaction feature groups. Since the features in the interaction feature group are similar, they can be treated as a single feature group. Therefore, for the features corresponding to each interaction feature group, the temporal series interaction attention weight is computed based on the data count in the interaction feature group set, which is expressed as

A_{i, h}^{c} (t) = P_{h} λ_{i, h} (t) = \{A_{i, 1}^{c} (t), \dots, A_{i, h}^{c} (t), \dots, A_{i, m_{i}}^{c} (t)\}

(23)

where:

A_{i, h}^{c} (t)

represents the temporal interaction attention weight for feature

y_{h} (t)

.

λ_{i, h} (t)

represents the temporal interaction attention weight coefficient for feature

y_{h} (t)

.

P_{h}

denote the number of elements in the interaction feature group set

P_{h} (t)

, and

P_{h} (t)

represents the

h

-th interaction feature group.

y_{h} (t)

corresponds to the feature in the interaction feature group

P_{h} (t)

. The higher the value of

P_{h}

, the more data within a specific time range have been identified as that feature, meaning the feature is representative and typical, and thus the feature attention is higher. The minimum value of

P_{h}

is 1, meaning that in time slot

t

, only one channel identifies this unique feature, forming a feature group that only contains that feature. Since the feature is rare, the feature attention is lower.

Furthermore, the temporal interaction attention in time slot

t

is expressed as

A_{i} (t) = A_{i}^{o} (t) + A_{i}^{c} (t)

(24)

where:

A_{i} (t) = {A_{i, 1} (t), A_{i, 2} (t), \dots, A_{i, k} (t), \dots A_{i, n_{i}} (t)}

.

A_{i, k} (t)

represents the temporal interaction attention weights for the

k

-th data stream in time slot

t

, where

k

is the data stream index.

3.2. Precise Diagnosis Method of Image and Voiceprint Defects Based on Closed-Loop CNN

Based on the acquisition of temporal interactive attention, the attention weights in Formula (19) can be modified, and the temporal features and cross-module data information of the image and voiceprint fusion data can be fully mined. This enables the precise diagnosis of image and voiceprint data features, which can be expressed as

Y (t) = Γ_{CNN}^{'} ((W_{1} + W_{2}) \cdot S w i s h (Y_{all} (t)) + b)

(25)

where:

W_{i} = {W_{i, 1}, W_{i, 2}, \dots, W_{i, n_{i}}}, i = 1, 2

,

W_{i, k} = W_{i, k, 0} A_{i, k} (t)

.

W_{i, k}

represents the corrected attention coefficients for the

k

-th data, and

W_{i, k, 0}

represents the initial attention coefficient for the

k

-th data.

Y (t) = {y_{1} (t), \dots, y_{q} (t), \dots y_{n} (t)}

represents the fine identification result of the image and voiceprint fused data features in time slot

t

.

y_{q} (t)

represents the

q

-th fine diagnostic result of the image and voiceprint features in time slot

t

, and

q

represents the fine diagnostic feature index.

n

represents the number of fine-diagnosed features.

In addition, we propose a method in this section for updating the feature diagnosis model of image and voiceprint data based on closed-loop feedback, which adaptively updates the preliminary identification of image and voiceprint features, the temporal interactive attention of image and voiceprint fusion data, and the fine diagnosis of defect features of image and voiceprint fusion data according to the feature identification results and defect conditions. This can realize the dynamic optimization of the diagnostic model and enhance the comprehensiveness, robustness, and reliability of defect diagnosis.

Specifically, the full data feature set

Y_{all} (t)

and the fine diagnostic results from the backend

Y (t)

are used. If the features diagnosed by the backend are not identified by the frontend, these features are added to the original full data feature set

Y_{all}^{'} (t) = \{Y_{all} (t), y_{k} (t)\}

, resulting in a new full data feature set. This new set is then input into the temporal interaction attention for attention identification, where

y_{k} (t)

represents the features diagnosed by the backend that are not identified by the frontend.

If the self-attention weight of

y_{k} (t)

is greater than the average self-attention weight of the full data feature set, the attention of feature

y_{k} (t)

takes precedence, which is expressed as

A_{i, k}^{o} (t) > E [A_{i}^{o} (t)]

(26)

where:

A_{i, k}^{o} (t)

represents the self-attention weight of feature

y_{k} (t)

, and

E [A_{i}^{o} (t)]

represents the average self-attention weight of the full data feature set.

Based on this, the preliminary identification of image and voiceprint fused data features is updated. Specifically, the self-attention weight of the feature

y_{k} (t)

in

A (t)

is updated to

A_{i, k}^{o} (t)

.

If the attention of feature

y_{k} (t)

does not take precedence, but based on historical defect feature diagnosis results, the appearance of E is indeed followed by defect E (which has not been detected by the data features). In that case, feature

y_{k} (t)

can be considered as a potential feature for defect E. In this case, the temporal interaction attention parameters for the image and voiceprint fused data are adjusted to increase the model’s attention to feature

y_{k} (t)

, thereby improving the accuracy of early warning for the potential defect E. Specifically, the correlation

z_{k}^{E} (t)

between feature

y_{k} (t)

and the occurrence of defect E is calculated, expressed as

z_{k}^{E} (t) = num (\frac{y_{k} (t)}{D_{E} (t)})

(27)

where:

num (\cdot)

is a counting function and

D_{E} (t)

represents the historical feature identification results of defect E. This means that the higher the frequency of feature

y_{k} (t)

appearing in the historical feature identification results of defect E, the greater the correlation between feature

y_{k} (t)

and the occurrence of defect E. Further, the temporal interaction attention parameters for the image and voiceprint fused data are adjusted, expressed as

λ_{k} (t + 1) = \{\begin{cases} λ_{k} (t) + β \exp (z_{k} (t) - z_{0}), z_{k} (t) > z_{0} \\ λ_{k} (t), z_{k} (t) \leq z_{0} \end{cases}

(28)

where:

λ_{k} (t)

represents the temporal interaction attention weight coefficient for feature

y_{k} (t)

.

λ_{k} (t + 1)

represents the temporal interaction attention weight coefficient for feature

y_{k} (t + 1)

.

z_{0}

is the correlation threshold and

β

is the update step size. This equation means that when the correlation

z_{k} (t)

between feature

y_{k} (t)

and defect E exceeds the threshold

z_{0}

, the temporal interaction attention weight coefficient for feature

y_{k} (t)

is increased in the time slot

t + 1

, thereby increasing the attention to feature

y_{k} (t)

.

If a feature with a dominant attention identified by the front-end is repeatedly screened out by the back-end image and voiceprint fusion data feature diagnosis, the CNN network in the image and voiceprint fusion data feature diagnosis area is updated in a closed loop, i.e.,

π_{CNN} (k + 1) = π_{CNN} (k) - \log_{π_{CNN}} \sqrt{\frac{1}{N} \sum_{t = 1}^{N} {[Y_{all} (t) - Y (t)]}^{2}}

(29)

where:

π_{CNN}

represents the CNN network for the fine diagnosis of the image and voiceprint data features.

\sqrt{\frac{1}{N} \sum_{t = 1}^{N} {[Y_{all} (t) - Y (t)]}^{2}}

represents the identification error caused by discarding feature

y_{k} (t)

during the fine diagnosis of the image and voiceprint fused data features.

4. Simulation Verification

In order to verify the accuracy of the proposed method, a dataset with 5 years of historical data and 10,000 samples is constructed, including the image and the voiceprint signal of the equipment at the fault location in the distribution network. From the fault dataset, seven typical fault samples of electrical equipment in the distribution network are selected: partial discharge of medium voltage switchgear, mechanical fault, loose transformer core, DC bias, heavy overload, transmission line damage, and bird’s nest. The initial 7500 samples were used for training, while the remaining 2500 were designated for testing. During training, the maximum number of iterations is set to 200, with a learning rate of 0.1. The training set data and results are used to train the image and voiceprint feature joint perception network based on bidirectional coupled attention, while the test set fault data are fed into the trained network for detection.

In addition, we select two algorithms for comparative analysis to verify the effectiveness of the proposed algorithm. Baseline 1 adopts the LSTM algorithm [19], which performs separate feature extraction and processing for the image and voiceprint signals, without splicing and coupling the features of the two modalities. LSTM is mainly used to capture time sequence features and identify potential defects by analyzing the signal at each time step. However, when the environmental background noise is large or the image features are blurred, LSTM is less robust in defect identification. Baseline 2 uses the traditional CNN algorithm [20], which uses the classical convolutional layer for feature learning and classifies through pooling and fully connected layers. However, it does not perform closed-loop feedback adjustment based on the actual defect situation, which limits the accuracy and robustness of defect identification. Traditional CNN also has certain shortcomings in the fusion of multimodal information and it is difficult to fully combine the complementarity between image and voiceprint features, which limits its identification effect under complex working conditions. The baseline algorithm uses the same training set and test set as the proposed algorithm for simulation. We verify the performance of the proposed algorithm by comparing the accuracy of defect identification and the convergence performance with two baseline algorithms.

In order to preliminarily test the effectiveness of the electrical equipment defect identification model in this paper, after model training on the dataset, the confusion matrix is used to measure the identification accuracy of the model for each state sample, as shown in Figure 4. The identification accuracy of defects is higher than 97%. This indicates that the image and voiceprint feature joint perception model based on bidirectional coupled attention proposed in this paper has good sensitivity and identification ability for electrical equipment defects.

To evaluate the accuracy of this algorithm, five metrics are selected: accuracy, precision, recall, average precision (

m A P

), and F1 score.

F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall}

(30)

Precision = \frac{N_{TP}}{N_{TP} + N_{FP}}

(31)

Recall = \frac{N_{TP}}{N_{TP} + N_{FN}}

(32)

m A P = \sum_{m = 1}^{M} A P_{m} / M

(33)

where:

N_{TP}

,

N_{FP}

, and

N_{FN}

represent the number of true detections, false detections, and missed detections, respectively.

A P

refers to the area under the Precision–Recall (P-R) curve, and

m

represents the number of detection categories.

Figure 5 shows the comparison of the defect identification results. Compared with the two baseline algorithms, the accuracy of the proposed algorithm is improved by 12.1% and 33.7%, the precision is increased by 11.3% and 34.3%, the recall rate is increased by 12.5% and 33.1%, the F1 value is increased by 10.4% and 34.1%, and the average accuracy is increased by 16.2 and 34.3%, respectively. The proposed algorithm analyzes the temporal characteristics of image and voiceprint data, and adaptively adjusts the attention weight of the network according to the time interval and correlation. This can strengthen the attention to typical and frequent defects, improving the accuracy of preliminary identification. Additionally, the closed-loop CNN realizes multi-stage iterative optimization by feeding back the results of preliminary identification to the fine diagnosis network. This allows the diagnosis network to carry out detailed and accurate diagnosis for the actual defects in complex environments, improving robustness and accuracy.

Figure 6 demonstrates the change of mAP versus slots. The average accuracy value of the proposed algorithm rises to about 0.85 in the first 30 slots, and finally stabilizes at 0.931. Baseline 2 shows a slow increase in mAP to 0.6 around 50 training slots, eventually stabilizing at 0.611. Compared with the two baseline algorithms, the convergence speed of the proposed algorithm is increased by 22.92% and 60.42%, respectively. The proposed algorithm uses the vertical self-attention mechanism to explore the intricate relationships within the modality through the multi-head attention mechanism, so that the network can adaptively pay attention to more critical features and improve the model’s perception of complex defects. Horizontal coupling attention introduces cross-modal information interaction, so that the semantic gap between the image and the voiceprint can be effectively bridged. This significantly enhances collaboration between different modalities while improving the efficiency and accuracy of defect identification.

Figure 7 shows the spatial distribution of features after image and voiceprint joint perception and fault feature extraction. The three coordinate axes represent the principal component of the image features, the principal component of the voiceprint feature, and the multi-modal fusion feature, respectively. Each point corresponds to a fault sample, and the fault sample is mapped into the three-dimensional space through principal component analysis to form the feature point. The close aggregation of the feature points means that the discriminative nature of the fault type in the high-dimensional feature space is enhanced, which indicates that the method successfully separates the features of different fault types and achieves high effectiveness and accuracy in the process of feature extraction and fault identification. It is worth noting that defect 4 and defect 7 are not close to the center, which further highlights their distinct characteristics and effective separation from other defects. This is a desirable outcome for accurate fault identification and diagnosis. The vertical self-attention mechanism can mine more complex feature associations within image and voiceprint modalities. The horizontal coupling attention mechanism eliminates the semantic gap between different modalities through the exchange of cross-modal information. This allows the features of image and voiceprint to complement each other, thereby improving the accuracy of defect identification. Through this bidirectional coupled attention mechanism, the model can not only capture the unique characteristics of each modality, but also enhance the synergistic effect between them, so that fault samples of the same type can be tightly clustered together in the feature space. Through the effective fusion of image and voiceprint features, the proposed method successfully extracts the core features of the equipment fault and can effectively map these features to the low-dimensional space. In the samples of the same fault type, the aggregation of feature points indicates that the features of the fault type are strengthened and clearly resolved under the multi-modal information fusion, so as to achieve more efficient fault identification and diagnosis. Compared with the traditional single-modal feature extraction method, the accuracy and robustness of identification are significantly improved. Especially in complex backgrounds and with multi-source interference, the model can maintain strong identification ability through the complementarity of multi-modal features.

Considering the potential abnormal states in the network, two types of distortion are added to the collected images and voiceprint signals of the equipment defect to verify the robustness of the identification and diagnosis model. The first one is Gaussian white noise and the second one is a decrease in the quality of the image. As a rule of thumb, we randomly put 90 percent of the distortion samples into the training dataset, and the rest of the experimental images and voiceprint signals are put into the test dataset at the beginning of each experiment.

Table 1 shows that the proposed algorithm maintains accurate fault target detection even with added noise. Although its detection accuracy slightly declines, the average accuracy and recall rate remain high, demonstrating strong anti-noise capability. However, as signal distortion increases, the model’s identification accuracy decreases. This is because the proposed algorithm integrates image and voiceprint features using a bidirectional coupled attention mechanism, along with multi-modal information fusion, to realize multi-modal joint perception. Integrating and complementing information from different feature sources allows key features to be preserved in distorted signals while reducing noise interference. By incorporating multi-dimensional voiceprint information, the model’s noise resistance and robustness are enhanced, which is why the decline in performance is not significant. In addition, the adaptive weight adjustment of the temporal adaptive CNN and the feedback mechanism of the closed-loop CNN further improve the model’s adaptability to abnormal inputs, so that the model can identify the effective features of interest through optimization during the training process. This ensures that detection accuracy is maintained even under distorted signals.

5. Conclusions

This paper addressed the issues of efficiency and accuracy in defect identification and diagnosis of electrical equipment in distribution networks by proposing a defect identification and diagnosis method for distribution network electrical equipment based on fused image and voiceprint joint perception. Through initial feature extraction and concatenation coupling, a bidirectional attention mechanism was constructed to enhance the efficiency of defect identification and diagnosis. Meanwhile, by combining temporal adaptability and a closed-loop feedback mechanism, precise diagnosis of equipment defects is achieved, significantly improving diagnostic accuracy and robustness. Baseline 1, an LSTM-based algorithm that performs separate feature extraction and processing for image and voiceprint signals, and Baseline 2, a traditional CNN algorithm that uses classical convolutional layers for feature learning and classification, are chosen as comparison benchmarks. Compared to the baselines, the proposed method significantly enhances defect identification accuracy by 12.1% and 33.7%, increases recall by 12.5% and 33.1%, and boosts diagnostic efficiency by 22.92% and 60.42%. Future research will focus on the integration of more modal signals, particularly combining voltage, current, and electromagnetic signals with image and voiceprint features, to provide multi-dimensional data support for fault identification and diagnosis. Additionally, further exploration will be conducted on how to leverage deep learning and other advanced machine learning methods to achieve deeper multimodal signal fusion, supporting more accurate fault identification and diagnosis.

Author Contributions

Conceptualization, A.C., J.L., J.F., and B.L.; methodology, A.C.; software, J.L.; validation, J.L. and S.L.; formal analysis, S.L.; investigation, S.L.; resources, A.C.; data curation, J.L. and S.L.; writing—original draft preparation, A.C., J.L., and J.F.; writing—review and editing, A.C., S.L., and B.L.; project administration, A.C.; funding acquisition, A.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Southern Power Grid Corporation Science and Technology Project, grant number GDKJXM20240115.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

Authors An Chen, Junle Liu and Silin Liu were employed by the China Southern Power Grid Guangdong Zhongshan Power Supply Bureau. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Jiang, L.; Li, W.; Fu, B.; Bai, L. A muti-layer fault component identification method based on muti-source information fusion in distribution power grid. In Proceedings of the 2022 7th Asia Conference on Power and Electrical Engineering (ACPEE), Hangzhou, China, 15–17 April 2022. [Google Scholar] [CrossRef]
Sun, Y.; Zhou, Q.; Qin, X.; Zhang, Y.; Zhang, J.; Zhu, H. Identification index and method of serious faults and weak links in power system transient power angle stability. In Proceedings of the 2019 4th IEEE Workshop on the Electronic Grid (eGRID), Xiamen, China, 11–14 November 2019. [Google Scholar] [CrossRef]
Zhou, N.; Xu, Y. A prioritization method for switchgear maintenance based on equipment failure mode analysis and integrated risk assessment. IEEE Trans. Power Deliv. 2024, 39, 728–739. [Google Scholar] [CrossRef]
Hatziargyriou, N.D.; Milanovic, J.V.; Rahmann, C.; Ajjarapu, V.; Vournas, C. Definition and classification of power system stability-revisited & extended. IEEE Trans. Power Syst. 2020, 36, 3271–3281. [Google Scholar] [CrossRef]
Huang, X.; Zhang, X.; Zhang, Y.; Zhao, L. A method of identifying rust status of dampers based on image processing. IEEE Trans. Instrum. Meas. 2020, 69, 5407–5417. [Google Scholar] [CrossRef]
Yu, Z.; Wei, Y.; Niu, B.; Zhang, X. Automatic condition monitoring and fault diagnosis system for power transformers based on voiceprint identification. IEEE Trans. Instrum. Meas. 2024, 73, 1–11. [Google Scholar] [CrossRef]
Lin, Y.; Wan, W.; Shang, B.; Li, X. Fault diagnosis method of power equipment based on infrared thermal images. In Proceedings of the 2023 IEEE International Conference on Power Science and Technology (ICPST), Kunming, China, 5–7 May 2023. [Google Scholar] [CrossRef]
Kong, L.; Wang, Y.; Chen, J.; Chen, Y.; Chen, M.; Liu, J. Research on the fuzzy enhancement algorithm for infrared images of power equipment based on membership function. In Proceedings of the 2024 4th Power System and Green Energy Conference (PSGEC), Shanghai, China, 22–24 August 2024. [Google Scholar] [CrossRef]
Yin, J.; Li, Z.; Cui, L.; Zhang, W.; Wang, Q.; Si, G. CycleGAN-based visible-infrared image enhancement method for infrared power equipment object detection. In Proceedings of the 2023 IEEE 5th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 14–16 July 2023. [Google Scholar] [CrossRef]
Qi, X.; Shi, L.; Li, X.; Hao, C.; Ji, S.; Chai, F.; Han, D. Transformer voiceprint feature extraction and fault recognition based on MFCC and deep learning. In Proceedings of the 2023 IEEE 5th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 14–16 July 2023. [Google Scholar] [CrossRef]
Satea, M.; Elsadd, M.; Zaky, M.; Elgamasy, M. Reliable High Impedance Fault Detection with Experimental Investi-gation in Distribution Systems. Eng. Technol. Appl. Sci. Res. 2024, 14, 17248–17255. [Google Scholar] [CrossRef]
Meng, S.; Peng, W.; Tian, C. Unbalanced data-driven abnormal power usage detection method based on gated cyclic unit. Comput. Meas. Control 2023, 31, 54–60. [Google Scholar] [CrossRef]
Madhloom, J.K.; Abd Ghani, M.K.; Baharon, M.R. Enhancement to the patient’s health care image encryption system, using several layers of DNA computing and AES (MLAESDNA). Period. Eng. Nat. Sci. (PEN) 2021, 9, 928–947. [Google Scholar] [CrossRef]
Hassen, O.A.; Majeed, H.L.; Hussein, M.A.; Darwish, S.M.; Al-Boridi, O. Quantum Machine Learning for Video Compression: An Optimal Video Frames Compression Model Using Qutrits Quantum Genetic Algorithm for Video Multicast over the Internet. J. Cybersecur. Inf. Manag. (JCIM) 2025, 15, 43–64. [Google Scholar] [CrossRef]
Qiu, J.; Liang, Y.; Cheng, X.; Zhao, X.; Ma, L. State assessment of key equipment of microgrid based on multi-source data fusion method. In Proceedings of the 2022 IEEE 10th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Chongqing, China, 17–19 June 2022. [Google Scholar] [CrossRef]
Xing, Z.; He, Y. Multimodal mutual neural network for health assessment of power transformer. IEEE Syst. J. 2023, 17, 2664–2673. [Google Scholar] [CrossRef]
Ziuzev, A.; Nakataev, A.; Shelyug, S.; Ippolitov, V. Influence of an electric drive with periodic load on voltage quality. In Proceedings of the 2021 28th International Workshop on Electric Drives: Improving Reliability of Electric Drives (IWED), Moscow, Russia, 27–29 January 2021. [Google Scholar] [CrossRef]
Chen, Z.; Wang, Q.; He, Q.; Yu, T.; Zhang, M.; Wang, P. CUFuse: Camera and ultrasound data fusion for rail defect detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21971–21983. [Google Scholar] [CrossRef]
Liang, Y.; Zhang, J.; Shi, Z.; Zhao, H.; Wang, Y.; Xing, Y.; Zhang, X.; Wang, Y.; Zhu, H. A fault identification method of hybrid HVDC system based on wavelet packet energy spectrum and CNN. Electronics 2024, 13, 2788. [Google Scholar] [CrossRef]
Moradzadeh, A.; Teimourzadeh, H.; Mohammadi-Ivatloo, B.; Pourhossein, K. Hybrid CNN-LSTM approaches for identification of type and locations of transmission line faults. Int. J. Electr. Power Energy Syst. 2022, 39, 107563. [Google Scholar] [CrossRef]

Figure 1. Principle of joint perception of image and voiceprint features based on bidirectional coupled attention.

Figure 2. Schematic diagram of the defect identification and diagnosis method of distribution network electrical equipment based on two-stage CNN.

Figure 3. Flow chart of the two-stage CNN defect preliminary identification and fine diagnosis method.

Figure 4. Confusion matrix for fault sample defect identification.

Figure 5. Comparison of detection results of different algorithms.

Figure 6. The change of mAP versus slots.

Figure 7. Spatial distribution of features after image and voiceprint joint perception and fault feature extraction.

Table 1. Comparison of detection results of each algorithm after adding distortion.

	Accuracy	Precision	Recall	Average Precision	F1 Score
Proposed Algorithm	0.871	0.833	0.901	0.916	0.897
Baseline 1	0.733	0.781	0.693	0.763	0.699
Baseline 2	0.493	0.477	0.510	0.442	0.491

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, A.; Liu, J.; Liu, S.; Fan, J.; Liao, B. Defect Identification and Diagnosis for Distribution Network Electrical Equipment Based on Fused Image and Voiceprint Joint Perception. Energies 2025, 18, 3451. https://doi.org/10.3390/en18133451

AMA Style

Chen A, Liu J, Liu S, Fan J, Liao B. Defect Identification and Diagnosis for Distribution Network Electrical Equipment Based on Fused Image and Voiceprint Joint Perception. Energies. 2025; 18(13):3451. https://doi.org/10.3390/en18133451

Chicago/Turabian Style

Chen, An, Junle Liu, Silin Liu, Jinchao Fan, and Bin Liao. 2025. "Defect Identification and Diagnosis for Distribution Network Electrical Equipment Based on Fused Image and Voiceprint Joint Perception" Energies 18, no. 13: 3451. https://doi.org/10.3390/en18133451

APA Style

Chen, A., Liu, J., Liu, S., Fan, J., & Liao, B. (2025). Defect Identification and Diagnosis for Distribution Network Electrical Equipment Based on Fused Image and Voiceprint Joint Perception. Energies, 18(13), 3451. https://doi.org/10.3390/en18133451

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Defect Identification and Diagnosis for Distribution Network Electrical Equipment Based on Fused Image and Voiceprint Joint Perception

Abstract

1. Introduction

2. Joint Perception of Image and Voiceprint Features Based on Bidirectional Coupled Attention

2.1. Extraction and Construction of Image and Voiceprint Features

2.2. Construction of Bidirectional Coupled Attention

2.3. Cross-Modal Deep Interaction

3. Defect Identification and Diagnosis of Distribution Network Electrical Equipment Based on Two-Stage CNN

3.1. Preliminary Identification of Image and Voiceprint Data Based on Attention Mechanism with Time Sequence Characteristics

3.2. Precise Diagnosis Method of Image and Voiceprint Defects Based on Closed-Loop CNN

4. Simulation Verification

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI