1. Introduction
As the core equipment of pumped-storage power plants, hydroelectric generator sets operate under high loads for extended periods, making them prone to faults such as bearing wear and blade cavitation. These faults can reduce operational efficiency, shorten equipment lifespan, and, in severe cases, cause shutdowns, leading to significant economic losses and safety hazards [
1,
2]. Therefore, timely and accurate diagnosis of the fault status of hydroelectric generator sets holds significant practical importance. The acoustic signals generated during the operation of hydroelectric generator sets contain valuable information reflecting the internal mechanical operating conditions of the equipment. When the operational status of the equipment changes, the corresponding acoustic signals undergo corresponding changes. Acoustic diagnosis technology [
3,
4,
5] identifies faults by analyzing the acoustic signals generated during unit operation, without the need for disassembly or shutdown inspections, enabling early warning and diagnosis of faults without disrupting normal unit operation.
Extensive research has been conducted in the field of acoustic signature fault diagnosis for hydroelectric units [
6,
7,
8]. Reference [
9] utilizes acoustic signature information collected by non-contact microphone sensors to predict the remaining lifespan of rotating machinery, proposing a new method for selecting a subset of modulated spectral features using information theory methods. Reference [
10] proposes an improved Mel-frequency cepstral coefficient feature extraction method for fault diagnosis of a gas-insulated switchgear (GIS) in power plants, and verifies that the proposed method can effectively improve the efficiency and reliability of online diagnosis of GIS. Reference [
11] utilizes the significant changes in machine vibration frequency, vibration amplitude, and time-domain waveforms during faults to extract features from vibration signals to verify the corresponding fault types. Reference [
12] characterizes the fault status of mechanical equipment by monitoring and analyzing the electrical signals generated by the target object. However, the fault feature information in the signals collected during the early stages of a fault is very weak, and the effectiveness of fault feature extraction using the electrical signal diagnosis method is generally poor. Reference [
13] comprehensively analyzes vibration signals, electrical signals, and other monitoring parameters to achieve a more comprehensive diagnosis of the operating status of hydroelectric generator bearings; however, the cost of data acquisition is relatively high. However, the aforementioned literature relies on single feature thresholds or shallow machine learning models, making it difficult to effectively capture the complex relationship between the non-stationary characteristics of acoustic signals and faults, resulting in low fault diagnosis accuracy.
With the continuous development of deep learning, the diagnostic accuracy of hydroelectric generator sets has also improved [
14,
15]. Reference [
16] proposed a method based on two-dimensional set local mean decomposition and optimized dynamic least squares support vector machines to rapidly and efficiently diagnose faults in anti-friction bearings. Reference [
17] proposed a method for predicting bearing anomalies using LSTM-KLD, which utilizes divergence methods to uncover the distribution patterns of anomalous data and effectively extract data features. Reference [
18] utilizes deep neural networks to extract sound signal features, achieving frame-level hierarchical classification with good classification results. Reference [
19] employs a particle swarm optimization algorithm to enhance the extreme learning machine, combined with variational mode decomposition, to extract weak features from early-stage bearing fault monitoring data. Reference [
20] employs an improved autoregressive integrated moving average (ARIMA) model to capture the temporal correlation and probability distribution of wind speed time series data, thereby constructing a time series-based prediction model, Reference [
21] enhances the maintenance level of hydroelectric turbine units by extracting more sensitive IMF components from acoustic signals and training them using a convolutional neural network (CNN). Although these deep learning-based fault diagnosis methods can effectively identify common faults in hydroelectric generator units with the support of a large number of fault samples and different acoustic features, several key issues remain to be addressed in practical applications: (1) The complex network structure leads to a significant increase in computational load, resulting in degraded real-time system performance; (2) Traditional networks are constrained by gradient vanishing or explosion phenomena, making it difficult to extract effective features from high-dimensional data, and they are also prone to overfitting issues; (3) Existing defect diagnosis methods mostly only consider single defects in hydroelectric generator sets and do not consider acoustic pattern recognition issues when multiple defects occur simultaneously.
To address the issues of defect identification only considering single defects, high model complexity, and susceptibility to overfitting, this paper constructs a fault diagnosis model based on acoustic characteristics for hydroelectric generator sets using ResNet + self-attention. The main contributions of this paper are as follows:
- (1)
To address the poor feature extraction capability of traditional diagnostic models, the network structure of the traditional residual network is optimized to enable deeper training for acoustic signature feature extraction, faster model convergence, and avoidance of gradient vanishing and explosion issues.
- (2)
To address the varying contributions of different regions in acoustic signature signals to fault diagnosis, the self-attention mechanism is employed to enable the model to automatically focus on key features related to faults.
- (3)
To address the issue of low diagnostic accuracy, we use Bayesian optimization algorithms to process high-dimensional hyperparameter spaces, accelerating model convergence and improving diagnostic performance.
- (4)
We integrate the model into an edge computing platform and embed it into a hydroelectric generator acoustic monitoring system, validating the algorithm’s generalization capabilities in real-world noisy environments and providing a practical solution for intelligent maintenance of power equipment.
This paper first introduces the overall system structure and details the entire process of fault identification for hydroelectric generator sets based on acoustic signature signals. It then explains the acoustic signature noise reduction method for hydroelectric generator sets, first introducing the EEMD principle, then using correlation coefficient analysis to screen for effective signals, and finally providing the noise reduction process. It then focuses on analyzing the fault identification network, covering the self-attention mechanism, ResNet50 principle, fusion network architecture, and Bayesian optimization algorithm. Finally, through experiments in real-world environments, the algorithm’s effectiveness is validated from multiple aspects, including deployment, noise reduction, and diagnostic results.
4. Fault Identification Network
The fault identification network trains the ResNet + self-attention network to classify faults using the denoised IMF signal as a dataset. To solve the model hyperparameter optimization problem, a Bayesian optimization mechanism is introduced to optimize the model’s hyperparameters. The workflow is shown in
Figure 3.
4.1. Self-Attention
In the task of acoustic fingerprint fault diagnosis for hydroelectric generator sets, acoustic fingerprint signals contain rich information, but the contribution of information from different regions to fault identification varies. The self-attention mechanism architecture is shown in
Figure 4.
For the input signal, a linear transformation is first applied to obtain the query matrix
Q, key matrix
K, and value matrix
V. Then, the dot product of the query matrix
Q and key matrix
K is calculated and normalized using the SoftMax function to obtain the attention weight matrix, which describes the importance relationship between pixels in different regions of the feature map. Finally, the attention weight matrix is multiplied by the value matrix
V to obtain the weighted feature representation:
By introducing the self-attention mechanism, the model can automatically learn the importance of different regions in the voiceprint signal, enabling it to more accurately capture key features related to faults and thereby improve the accuracy of fault diagnosis.
4.2. ResNet50
ResNet50 is a residual network with 50 layers, designed to address the issues of gradient vanishing or explosion that often arise during the training of deep neural networks. It enables the construction of deeper network models to learn more rich feature representations. It can automatically learn feature representations at different levels from acoustic feature maps, with lower-level residual blocks learning local features and higher-level blocks learning abstract global features, providing strong support for fault classification. The introduction of residual blocks and skip connections resolves gradient issues in deep network training, enabling the network to train deeper and learn complex features, thereby meeting the complexity and non-stationarity requirements of acoustic signals and ensuring good fault diagnosis performance under various operating environments and conditions.
4.3. ResNet + Self-Attention
Based on the analysis above, residual networks and attention mechanisms each have their unique advantages. Applying the attention mechanism globally can further enhance model performance. In this context, the three residual blocks at the end of the ResNet-50 model (one CRB and two IRBs) are replaced with three global self-attention blocks (GSABs). For shallow convolutions, applying self-attention to the extracted local features would result in high computational costs with low gains. In contrast, replacing the deep-layer, low-resolution, and high-channel feature maps with GSABs can effectively capture long-range dependencies. Moreover, the parameter count of these three layers accounts for only 10%, enabling an improvement in accuracy while controlling computational costs. In this module, the self-attention layer computes the global contextual features of the feature map input from the upper layer and performs weighted fusion of local features, thereby obtaining feature representations with higher discriminability and expressive power. Finally, the resulting outputs are fed into the subsequent pooling layer and fully connected layer to realize the integration of feature sets and decision-making, and the final classification results are obtained through the Softmax function. The network architecture is illustrated in
Figure 5.
The specific operation process of the network is described as follows:
Let the input sample be
x. Calculate the convolution of the convolution kernel
and
x to obtain the feature vector
:
where
and
are the parameters to be trained.
Batch normalization (BN) is applied to
to improve the training speed of the model. Let
be an
m ×
l matrix, and
be the element in the
i-th row and
j-th column. The normalized result
is expressed as:
where
e is a small number to prevent the mean square deviation from being zero,
is the element in the
i-th row and
j-th column, and
γ and
β are the parameters to be trained. At this point, the distribution of
is adjusted to a standard normal distribution, which allows the input data to fall within the region where the activation function is most sensitive, thereby avoiding the vanishing gradient problem. However, this also leads to a decrease in the network’s expressive power, rendering the network’s depth meaningless. Therefore, it is necessary to perform the inverse operation on the transformed
to enable the model to learn
γ and
β with optimal optimization effects.
Divide the normalized results into several non-overlapping segments, return the maximum value of each segment’s elements, and perform max pooling. This reduces the feature dimension and the number of parameters to be trained. The output of the max pooling layer is denoted as
for the
q ×
l matrix, and
for the
k-th row and
j-th column elements of
, denoted as:
where
s is the length of the non-overlapping segment.
Thereafter, the abstract expression of
pinp is extracted through multi-level residual units. The residual unit calculates the sum of the residual function and the input features, where the residual function is a nonlinear mapping composed of three convolutional layers
:
where
is the set of trainable parameters for the
t-th residual unit. In this paper,
t = 1, 2,…, 16. When
t = 1,
.
In IRB, the shortcut connection adds the features of two equal dimensions element-wise, i.e., an identity mapping. The output of the
tth residual cell
can be expressed as:
where
is a multidimensional tensor. IRB does not introduce additional parameters or computational overhead to the model, offering significant advantages in practical applications.
In CRB, since the output channels of the convolutional layers are modified, the feature dimensions are unequal during addition, necessitating dimension matching in the shortcut connection (achieved via 1 × 1 convolution). This increases the network’s parameters but also enhances performance. The process can be represented as:
where
and
are the parameters to be trained.
After extracting device fault features from the 16-layer residual network, they are subjected to global average pooling (GAP) before being connected to a fully connected layer. GAP calculates the average value of each element in the feature matrix of each dimension to obtain a feature vector of length . Compared to directly connecting to the fully connected layer, this approach reduces network parameters and prevents overfitting.
Subsequently, the fully connected layer maps the features obtained from GAP to the sample’s label space, with the output
denoted as:
where
is the set of trainable parameters for the fully connected layer.
Finally, the Softmax function is used to calculate the probability distribution of
in the label space. The probability of belonging to the
s-th fault type is:
where
is the set of parameters to be trained in the Softmax layer, and
S is the total number of types. In this paper,
S = 7. Finally, the category with the highest calculated probability is taken as the final classification result.
4.4. Bayesian Optimization
The Bayesian optimization algorithm is an algorithm for black-box function optimization problems, often used to optimize the hyperparameters of complex learning models. The algorithm estimates the objective function by continuously constructing a surrogate model, which is constructed by Gaussian Process Regression [
23] (GPR), and the next parameter value is selected through a certain strategy to optimize the objective function.
Figure 6 is the flowchart of the Bayesian optimization algorithm.
In the process of hyperparameter optimization, the Bayesian optimization algorithm selects the next optimal hyperparameter point through a model-based method, avoiding calculation on all possible hyperparameter combinations, and finding the optimal hyperparameter point in relatively few iteration steps. The above advantages make Bayesian optimization especially suitable for practical problems where the computation is expensive or the objective function is difficult to optimize. The pseudocode of the algorithm is shown in Algorithm 1.
Algorithm 1 Bayesian optimization algorithm |
Input: objective function ; Search space: ; Initial observation set: , where , ; Maximum number of iterations: T; Stopping condition S; Output: optimal parameters ; Optimal value: Steps:Initialization: Set the current dataset as and the current optimal value . Iterative optimization: for t = 1, 2,…, T - (1)
Gaussian process modeling: Assume , calculate posterior distribution: , . - (2)
Calculate the acquisition function: Choose the expected improvement criterion: . - (3)
Optimize the acquisition function: solve the next query point, . - (4)
Query objective function: calculate and update the dataset . - (5)
Check the stopping criterion: Terminate the iteration if it satisfies .
Extract the optimal solution from the dataset :
|
5. Experimental Results
5.1. Experiment Deployment
The experiment was carried out in the real working environment of a hybrid pumped storage power plant. The data acquisition system consisted of a Bruel & Kjær 4189 free-field microphone, an NI PXIe-4499 high-speed acquisition card, and an NVIDIA Jetson AGX Xavier edge computing platform. The data acquisition specifications are shown in
Table 1.
The sensor array is arranged at six key points such as the bearing seat, the frame, and the top cover of the unit to cover the radial/axial vibration-sensitive area. The composition of the data acquisition system is shown in
Figure 7.
The dataset construction is divided into three levels according to the degree of fault development, as shown in
Table 2.
Based on the fault severity grading criteria defined in
Table 2, a library of acoustic pattern samples covering the full life cycle of the fault was constructed in this study. As shown in
Table 3, six types of typical faults were collected according to three development stages: L1 (early), L2 (middle), and L3 (late). The sample size of L1–L3 levels for each type of fault is strictly controlled to within 85–86 groups, and the sample size of the normal state is 256 groups. These samples are divided into training set, validation set, and test set in the ratio of 7:3:1, and K-fold cross-validation is adopted for model training.
5.2. Hydroturbine Generator Set Voiceprint Noise Reduction Processing
The voiceprint signal from the
Section 4.1 experimental equipment acquisition of datasets, including 20 samples of EEMD decomposition results, are shown in
Figure 8.
The EEMD diagrams in
Figure 8 are on the left, and on the right are the corresponding spectra. EEMD will add a noise signal that is decomposed into six IMF components and a trend component, the IMF frequencies are arranged from high to low, each IMF contains certain frequency characteristics, and all the IMFs can decompose the signal before and after reorganization of the full spectrum diagram.
For all the IMF components with the original and the correlation analysis of the noise signal, the correlation coefficient was calculated; the results obtained are shown in
Table 4.
The sequence numbers of IMF components with high correlation with the noise signal were 1, 3, and 4. The correlation coefficients calculated for the trend components of 5, 6, and Res in the remaining IMFs were very small, which could be identified as low-frequency interference noise. The IMF2 correlation coefficient value is smaller than those of IMF components 1 and 4. Combined with the
Figure 2 spectrum, it can be concluded that IMF2 for the noise signal of high-frequency interference and noise will retain the serial numbers 1, 3, 4 of the IMF component after restructuring for the new denoising signal.
5.3. The Bayesian Fault Diagnosis Results After Optimization
In this paper, Bayesian optimization is used to optimize the initial learn rate of a single hyperparameter for 30 times in the range of (1 × 10
−5, 0.1). Kullback–Leibler (KL) divergence is used in the process of model training and validation. In the process of model training and validation, KL divergence is used as the loss function, and the optimization process is plotted as a curve, and the abscissas are logarithmic coordinates. The accuracy comparison under different initial learning rates is shown in
Figure 9, and the loss value comparison under different initial learning rates is shown in
Figure 10.
As shown in
Figure 9, the accuracy does not change monotonically with the learning rate; instead, it reaches a peak at a specific learning rate (e.g., 0.00126). This indicates that an appropriate learning rate can significantly enhance the model’s fitting ability. During the Bayesian optimization process, when the initial learning rate is set to 0.00126, the validation set accuracy reaches the maximum of 96.08%.
Figure 10 demonstrates that the loss value first decreases and then increases with the increase in the learning rate, with the minimum loss achieved under the optimal learning rate, where the validation loss is 0.67.
The ResNet + self-attention network model trained under this condition was exported and tested using the test set data.
Figure 11 presents the confusion matrix derived from the test results. This fault identification confusion matrix shows that the model exhibits excellent classification performance for faults: the test accuracy of the model trained with the optimized learning rate reaches 99.4%, with both precision and recall for all categories being ≥0.99, and the average F1-score reaching 0.997. Specifically, categories 1, 2, 4, and 7 achieve 100% precision and recall; only categories 3, 5, and 6 have 1–2 instances of misclassification into adjacent categories (e.g., category 5 → 6, category 6 → 7). However, the overall cross-confusion rate is extremely low, indicating that the model possesses extremely high recognition stability and accuracy for both single and mixed faults, thus verifying the effectiveness of the Bayesian optimization algorithm.
5.4. Single Fault and Mixed Fault Recognition Results
Figure 12a shows the comparison between the actual and simulated normal signals,
Figure 12b shows the comparison between the actual and simulated single faults (bearing pitting),
Figure 12c shows the comparison between the actual and simulated mixed faults (bearing wear + blade cavitation), and
Figure 12d is the comprehensive comparison. In the normal signal spectrum, both the actual and simulated signals are dominated by the fundamental frequency. The actual signal has slight spurs due to environmental noise, which confirms the stability of the fundamental frequency dominance in normal signals. In the single fault spectrum, the simulated signal clearly contains impact harmonics. Although the actual signal has spurs affected by noise, the core frequencies are consistent, which verifies the necessity of EEMD denoising for retaining fault features. In the mixed fault spectrum, the simulated signal shows the superposition of characteristic frequencies at 1.5 kHz, 8 kHz, and 12 kHz, and the core frequencies of the actual signal are consistent with it, reflecting the algorithm’s ability to analyze complex spectra. The comprehensive comparison chart intuitively shows significant spectral differences between different states, and the core frequencies of the actual and simulated signals are consistent, with only the actual signal affected by noise. This confirms that the algorithm still has high diagnostic accuracy in actual scenarios through denoising, feature focusing, and Bayesian optimization.
As can be seen from the data in
Table 5, in the condition of a single fault test, model accuracy is as high as 99.80%. The results show that when the hydro-generator units experience only a single type of failure, the algorithm can very accurately identify the fault type. In the mixed fault test state, the accuracy of the model reaches 98.63%. Although the accuracy of the model is slightly decreased compared with that of single fault recognition, it still remains at a high level, showing that in practice, when the hydroelectric generating set experiences a variety of problems at the same time, the algorithm can effectively extract the key features from the complex voiceprint signal and accurately identify mixed fault types, showing that the algorithm is capable of dealing with complex fault situations of stability and reliability.
5.5. Test Results Under the Sound Environment with Added Noise
Table 6 shows the algorithm test results under added noise and environmental sound. The data in the table shows that, under the conditions of the normal state and the normal state after adding the noise signal, the algorithm accuracy reached 100%, showing that the algorithm, in the normal state and in the normal state after adding the noise signal, can maintain better recognition ability and will not show much misjudgment; the algorithm has certain anti-interference ability. In the fault state, for all kinds of fault signals, such as bearing pitting, bearing wear, runner blade cavitation, rotor unbalance, cooling system anomality, and mechanical looseness fault, whether it is a simple fault signal or a fault signal with noise, the algorithm achieves 1500/1500 recognition results, and the accuracy rate is also 100%. It is fully proven that the proposed algorithm can effectively extract fault features in the face of an actual complex acoustic environment, and accurately identify various fault types without the interference of noise and environmental sound.
5.6. Comparison of Algorithm Performance
Table 7 presents a performance comparison between the algorithm proposed in this paper and LSTM-KLD [
17], Modified-ARIMA [
20], CNN [
21], ResNet50, and LSTM under four different scenarios: fault identification without voiceprint denoising, fault identification after voiceprint denoising, datasets with added noise, and datasets containing mixed faults.
- (1)
The fault identification of silent grain noise reduction
The accuracy of the proposed algorithm reaches 88.12%, which is higher than that of the other five algorithms. Among them, ResNet50 (84.22%) and Modified-ARIMA (85.44%) perform sub-optimally but are both lower than the proposed algorithm. This indicates that without denoising the voiceprint signals, the proposed algorithm can more effectively extract fault features from the original voiceprint signals and has stronger fault recognition capability. In terms of time consumption, the proposed algorithm takes 10.01 s, which is significantly faster than LSTM (25.55 s) and ResNet50 (22.15 s), demonstrating that the proposed algorithm has obvious advantages in computational efficiency and can complete the fault identification task in a shorter time.
- (2)
After a voiceprint noise fault recognition
The accuracy of the proposed algorithm reaches 100.0%, which is significantly higher than CNN (96.67%) and Modified-ARIMA (95.15%). This shows that after voiceprint denoising, the proposed algorithm can give full play to its advantages, capture fault features more accurately, and achieve completely accurate fault identification. In terms of time consumption, the proposed algorithm takes 4.02 s, which is much lower than ResNet50 (18.22 s) and LSTM (12.74 s), further reflecting the superiority of the proposed algorithm in computational efficiency, which can quickly complete fault diagnosis while ensuring high accuracy.
- (3)
Increasing noise in the dataset
The accuracy of the proposed algorithm is 98.24%, higher than that of CNN (94.54%) and Modified-ARIMA (92.12%). This indicates that when noise interference is added to the dataset, the proposed algorithm has strong anti-interference ability and can still accurately identify faults in complex noise environments. In terms of time consumption, the proposed algorithm takes 6.03 s, significantly faster than ResNet50 (17.55 s) and LSTM (13.69 s), indicating that when processing noisy datasets, the proposed algorithm can not only maintain high accuracy but also complete fault diagnosis at a faster speed, showing good real-time performance.
- (4)
Dataset containing mixed fault condition
The accuracy of the proposed algorithm is 98.22%, which is significantly higher than other algorithms (the highest for CNN is 91.52%). This indicates that when the dataset contains a mixture of multiple fault types, the proposed algorithm can effectively distinguish different fault features from complex voiceprint signals and achieve accurate identification of mixed faults. In terms of time consumption, the proposed algorithm takes 6.83 s, much faster than ResNet50 (19.52 s) and LSTM (18.11 s), indicating that the proposed algorithm also has efficient computing capability when processing mixed faults and can provide accurate diagnosis results in a shorter time.