4.1. Construction of Algorithm Model
To perform fault diagnosis of the motor bearing transmission system, a suitable algorithm model is designed based on the characteristics and quantity of the feature datasets. The one-dimensional features extracted guide the selection of an algorithm model that is tailored to such features. Thus, the choice of algorithm model is crucial for effectively achieving the task at hand.
Based on the achievements of acoustic signals in transformer fault diagnosis, it is decided to propose a CNN–attention mechanism–LSTM network model to realize the fault diagnosis of motor bearing transmission system.
The overall framework of the algorithm model can be divided into four parts.
First, the model starts with a one-dimensional convolutional layer aimed at capturing local features in the input data. This is followed by the utilization of a ReLU activation function, enhancing nonlinearity in the module. Subsequently, a maximum pooling layer is employed to reduce feature dimensionality while retaining essential features within the data. Notably, the convolutional layer in this module uses a kernel size of 3 and a stride of 1, while the maximum pooling layer employs a kernel size of 2 with a stride of 2.
The attention mechanism module comes second in the model. Various types of attention mechanisms are available and, for this study, we employ five specific attention mechanism modules, namely SE, CBAM, ECA, SA, and SpatialAttention [
43]. These modules selectively enhance the features processed by the one-dimensional convolutional neural network module.
The third part consists of the long short-term memory (LSTM) module, configured with a single layer containing a hidden size of 128.
The fourth part involves a fully connected layer that converts the processed features into output categories.
The construction of the entire algorithm model begins with capturing local features in the input using one-dimensional convolution, followed by enhancing nonlinearity through the ReLU activation function. Subsequently, the feature dimension is reduced via maximum pooling to retain key features. The processed features are then fed into the attention mechanism module for selective enhancement. Next, the features are flattened, a new dimension is added, and the data are then passed to the LSTM module. Finally, the data undergo direct conversion into the output of each category through a fully connected layer.
Rationale for Hyperparameter Selection
Convolutional layer (kernel size = 3, stride = 1): The kernel size of 3 is a standard choice for 1D-CNNs processing sequential features. It is large enough to capture local patterns and interactions between adjacent feature points, yet small enough to maintain a high degree of model efficiency and avoid overfitting. A stride of 1 is employed to ensure dense, sliding-window feature extraction, preserving the maximum amount of information from the input sequence.
Maximum pooling layer (kernel size = 2, stride = 2): It effectively reduces the spatial dimensions (length) of the features by half, thereby decreasing computational complexity and providing a degree of translational invariance. It helps to retain the most salient features while controlling overfitting. We choose maximum pooling over average pooling because bearing fault features are often manifested as impulsive components in the signal/feature domain, and maximum pooling is more effective at preserving such high amplitude, salient activations.
LSTM hidden size (128): The hidden state dimension of 128 represents a balance between model capacity and computational efficiency. It provides sufficient representational power to learn complex temporal dynamics from the condensed features output by the preceding CNN and attention modules.
4.3. Preliminary Results and Analysis
After the preliminary work is completed, training can be carried out using the feature datasets obtained as the input for the constructed algorithm model. The training set, validation set, and test set are divided at a ratio of 8:1:1. The training parameters include a batch size of 128, 200 epochs, a fixed learning rate of 0.001, utilization of the Adam optimizer, and the cross-entropy function as the loss function. Training is conducted using GPU acceleration.
First, the time domain and frequency domain feature dataset is used as input for training. The results are shown in
Table 7.
Secondly, the acoustic feature dataset is used as input for training. The results are shown in
Table 8.
Finally, the complete dataset consisting of time domain and frequency domain features and acoustic features is used as input for training. The results are shown in
Table 9.
Upon analysis of the results, it is evident that the fault diagnosis task is successfully completed, with a good classification effect achieved by utilizing acoustic features. Nonetheless, the accuracy obtained from solely employing time domain, frequency domain features, or a combined feature dataset is significantly lower and falls short of the desired outcome. Subsequent experimentation with these feature groups reveals persistently unstable accuracy levels ranging from 25% to 49%, rendering classification unattainable. Hence, it can be inferred that selecting acoustic features can a achieve classification task.
Next, we explore whether fault diagnosis can be achieved with fewer features. To investigate, we will select and verify the better features.
4.4. Feature Selection
Although the features of the datasets are rich, the contribution of different features to classification varies. Therefore, we conducted feature selection. Feature selection can identify and retain the most discriminative features, thereby achieving efficient and robust diagnosis.
Feature selection is a critical process in machine learning and data analysis since it involves choosing a subset of features from the original data that contribute to the model’s predictive performance [
4,
44]. ReliefF is a feature selection method that quantifies this “discriminative ability” by calculating a weight score for each feature. The higher the weight, the greater the contribution and importance of the feature to classification. The ReliefF algorithm was chosen for its effectiveness in handling multi-class problems and its ability to evaluate features based on their ability to distinguish between instances of different classes that are close to each other [
45].
In this paper, we utilized the ReliefF feature selection algorithm [
46] to select 18 features and created new datasets for analysis. The feature selection process was carried out twice. Initially, the top nine features were chosen from the time domain and frequency domain feature dataset and combined with the top nine features from the acoustic feature dataset, resulting in a 9 + 9 integrated dataset. Subsequently, the top 18 features from the acoustic feature set were selected to form the selected acoustic feature dataset. Given the large number of features involved, we showed the importance scores and rankings of the top 9 time domain and frequency domain features, as well as the top 18 acoustic features. To mitigate the impact of randomness, each set of experiments was repeated four times, and the features that appeared most frequently were selected for analysis. Considering the length of this paper, only one specific result of the selection will be presented here.
Table 10 and
Table 11 present the results from one experiment conducted. The final features selected for the two new datasets are detailed in
Table 12 and
Table 13, while the specific descriptions of the two new datasets are provided in
Table 14 and
Table 15.
4.5. Results and Analysis After Feature Selection
The feature datasets obtained after feature selection are used as input for training. Other parameters remain unchanged as above. The results are shown in
Table 16 and
Table 17:
From the above two sets of tables, it is evident that classification cannot be achieved when using the complete selection feature set that combines the two. Conversely, when utilizing the selected acoustic feature set, the classification performance is very good. Therefore, it can be concluded that the acoustic features chosen by the feature selection algorithm are effective in achieving classification.
To shed light on the advantages of selecting acoustic features for fault diagnosis, two parameters—parameter quantity and estimated total size—are introduced as reference points for comparison. The results of this analysis are presented in
Table 18. This approach allows us to examine the efficiency and effectiveness of selecting acoustic features both before and after performing fault diagnosis tasks.
It is evident from
Table 18 above that the ReliefF feature selection algorithm significantly reduces the number of parameters and estimated total size while maintaining the accuracy almost unchanged. This indicates the effectiveness of feature selection for acoustic features in fault diagnosis of motor efficiently bearing transmission systems. Consequently, fewer acoustic features can be utilized to achieve this task. After a comprehensive comparison of the above five models, the CNN-ECA-LSTM algorithm model was finally selected as the final structural choice for this task. The ECA attention mechanism module consists of an adaptive average pooling layer (AdaptiveAvgPool1d), a one-dimensional convolutional layer (Conv1d), and a sigmoid activation function, which enhances the features of important channels and suppresses the features of unimportant channels in a lightweight manner. The accuracy curve and loss function curve of its training are shown in
Figure 9.
The confusion matrix is shown in
Figure 10. These categories include 0 for the normal intact state bearing (nor), 1 for the bearing with outer ring defect (out), 2 for the bearing with inner ring defect (in), and 3 for the bearing with mixed inner and outer ring defects (all2).
To ensure experimental reliability, we used the ROC curve and corresponding AUC value as references. In a multi-classification task, we examined the ROC curve and AUC value for each class. These results are displayed in
Figure 11.
The dashed line represents the performance benchmark for a random classifier. When used as a reference, if a model’s ROC curve lies above this line, it indicates that the model possesses effective discriminatory power. The AUC value of each classification is 1, which means that the model can perfectly distinguish between positive and negative examples, that is, the performance of the model is optimal. The precision, recall and F1 score are shown in
Table 19.
Finally, we repeated the experiments on the selected algorithm model several times, and the results are shown in
Table 20.
The average of the seven results is 99.90% and the average training time is 36.64 s.
Our entire process involves only signal cropping without any additional signal processing step from the perspective of signal processing. We are able to achieve the recognition task with an average accuracy of 99.90%. Additionally, the AUC value corresponding to the ROC curve of each classification is 1, demonstrating the effectiveness of our method. Realize the classification of fault types in bearings and research on bearing fault diagnosis methods using sound signals and one-dimensional acoustic features.