Intelligent Diagnosis of Rolling Bearings Fault Based on Multisignal Fusion and MTF-ResNet

Existing diagnosis methods for bearing faults often neglect the temporal correlation of signals, resulting in easy loss of crucial information. Moreover, these methods struggle to adapt to complex working conditions for bearing fault feature extraction. To address these issues, this paper proposes an intelligent diagnosis method for compound faults in metro traction motor bearings. This method combines multisignal fusion, Markov transition field (MTF), and an optimized deep residual network (ResNet) to enhance the accuracy and effectiveness of diagnosis in the presence of complex working conditions. At the outset, the acquired vibration and acoustic emission signals are encoded into two-dimensional color feature images with temporal relevance by Markov transition field. Subsequently, the image features are extracted and fused into a set of comprehensive feature images with the aid of the image fusion framework based on a convolutional neural network (IFCNN). Afterwards, samples representing different fault types are presented as inputs to the optimized ResNet model during the training phase. Through this process, the model’s ability to achieve intelligent diagnosis of compound faults in variable working conditions is realized. The results of the experimental analysis verify that the proposed method can effectively extract comprehensive fault features while working in complex conditions, enhancing the efficiency of the detection process and achieving a high accuracy rate for the diagnosis of compound faults.


Introduction
As the power source of metro trains, the quality of the traction motor bearings directly affects the normal operation of the motor. The frequent starting and stopping of the metro causes alternating changes in the speed of the traction motor bearings and the loads they are subjected to. With long-term harsh working conditions, the inner and outer rings of bearings and rolling elements will produce varying degrees of pitting, cracking and more complex forms of failure. The adverse vibrations generated by a faulty bearing, when input into the entire system over an extended period, not only damage the traction motor but also pose a risk to other structural components. This poses a serious threat to the safety and reliability of metro trains. The intelligent diagnosis of bearings fault in complex working conditions enables the timely identification of fault types, facilitating early maintenance intervention and providing significant engineering value for practical applications.
Conventional approaches for bearing fault diagnosis predominantly rely on signal processing techniques. To address the issue of noise interference during feature extraction, wavelet thresholding was employed to effectively eliminate significant noise components from the raw data [1,2]. In an effort to enhance the signal-to-noise ratio, ref. [3,4] adopted empirical mode decomposition (EMD) to decompose the signal into multiple intrinsic mode functions. Furthermore, ref. [5] introduced an optimized variational mode decomposition transfer learning with ResNet, utilizing a pretrained ResNet model on ImageNet as a fault feature extractor, which yielded remarkably accurate results. These aforementioned studies have demonstrated promising outcomes in the realm of bearing fault diagnosis. However, certain limitations persist, including the sole reliance on a single sensor signal and the absence of experimental verification through the use of a purpose-built platform.
In summary, most of the studies are based on open source datasets with simple working conditions and failure forms, but the actual working conditions of bearings are complex and can present different parts and degrees of failure. To address the challenges faced in compound bearing fault diagnosis under complex working conditions, such as the low reliability of single sensor signals, the tendency for traditional data processing methods to result in important information loss, the degradation of diagnostic models with increasing network depth, and the difficulty of feature extraction, this paper proposes an intelligent diagnosis method for compound bearing faults in metro traction motors by combining MTF-processed acoustic-vibration signals using IFCNN for feature fusion along with an optimized version of ResNet. The main contributions of the paper are expressed as follows: 1.
The application of IFCNN in compound bearing fault diagnosis allows for the fusion of multiple signal features, reducing the limitations of single sensor signals and providing more reliable diagnostic results.

2.
The optimized ResNet model improves the efficiency of feature extraction by addressing the vanishing gradient problem. Combined with the MTF data processing method, it can effectively extract complex bearing fault features under varying working conditions with good accuracy and stability. 3.
The construction of a test platform for metro traction motor bearings was completed, and intelligent diagnosis of composite faults under variable working conditions was conducted, validating the effectiveness of the proposed methods.
The remaining sections of this paper are arranged as follows: In Section 2, the data processing method used in this study and the construction of the dataset are introduced. Section 3 focuses on the multisignal fusion technology used in this study. Section 4 provides a detailed description of the fault diagnosis model and the corresponding diagnostic process. Section 5 explains the specific experimental design, as well as the diagnostic scheme adopted in this study. Section 6 analyzes the experimental results and carries out a series of method comparisons to validate the effectiveness of the proposed approach. Section 7 summarizes the main content of the paper and draws conclusions.

Data Preprocessing
In this study, a signal acquisition system will be built to obtain a large amount of raw data using acoustic emission sensors, vibration sensors and PCI acquisition cards. The research focuses on compound faults, with pitting as the main defect. The location of the defect is used as a classification criterion. A total of eight fault types including normal bearings are designed and labeled for subsequent study, using different fault locations as classification indicators. The fault types and labels are shown in Table 1.

Dataset Construction
The vibration and acoustic emission signals were acquired using a PCI data acquisition card with a sampling frequency of 50 kS/s and a sampling time of 10 s, giving a total Sensors 2023, 23, 6281 4 of 19 of 5 × 10 5 sampling points. In this experiment, the minimum speed of the bearing is determined to be 800 rpm. Based on this speed, the number of sampling points obtained from one cycle of bearing rotation can be calculated to be 3750. In order to ensure the completeness of the sampled fault information, it is recommended that the number of sampling points be at least twice that of the calculated value, resulting in a sampling length of 8192 (2 13 ). With a limited amount of data, the vibration and acoustic emission signals were data augmented using overlapping sampling so that each fault type under each working condition contained 1000 samples for a total of 8000 samples, which were randomly divided into a training set and a testing set at 9:1. Under fixed working conditions, the dataset is divided as shown in Table 2.

MTF Image Encoding
In this paper, MTF is used to process vibration signals and acoustic emission signal data, converting the acquired data samples into image samples. MTF is an image encoding method that converts original vibration or acoustic emission signals into time series twodimensional images through Markov transition probabilities [25].
Suppose a discretized segment of time series data X = {x 1 , x 2 , · · · , x n } is partitioned into intervals of its value domain by quantile Q. Each x t in the sequence can be mapped to the corresponding interval q n (n ∈ [1, Q]). By calculating the state transfer probabilities through the Markov chain principle, a state transfer probability matrix W of size Q × Q can be obtained, with an expression, as shown in Equation (1), where w ij denotes the probability that a sample point in interval q j at moment t is transferred to interval q i at moment t + 1 [26].
By incorporating the temporal information into the state transfer probability matrix W and arranging each state transition probability w ij in time sequence, a Markov transition field (MTF) matrix M of size n × n is obtained as expressed in shown Equation (2) where m ij denotes the transition probability w ij between the intervals (q j → q i ) in which the sample points are located in time sequence.
The elements m ij in the MTF matrix are transformed as pixel points into a twodimensional feature image with temporal correlation. As the number of sample points selected directly affects the size of the generated coded image, it is clearly inappropriate for an image with too large a size to be used directly as input to the CNN. To improve computational efficiency, a fuzzy kernel 1 m 2 m×m is used to pixel average each region without overlap. Figure 1 shows images of different fault types after encoding each sample, consisting of 8192 sampling points, using MTF image encoding and subsequently subjecting them to pixel averaging processing. Compared to traditional time domain analysis methods, MTF encoding images preserve time-related information and enable clearer differentiation of various fault types in rolling bearings.
The elements ij m in the MTF matrix are transformed as pixel points into a two-dimensional feature image with temporal correlation. As the number of sample points selected directly affects the size of the generated coded image, it is clearly inappropriate for an image with too large a size to be used directly as input to the CNN. To improve computational efficiency, a fuzzy kernel is used to pixel average each region without overlap. Figure 1 shows images of different fault types after encoding each sample, consisting of 8192 sampling points, using MTF image encoding and subsequently subjecting them to pixel averaging processing. Compared to traditional time domain analysis methods, MTF encoding images preserve time-related information and enable clearer differentiation of various fault types in rolling bearings.

Multisignal Fusion
To enhance system stability and increase diagnostic reliability, this article collected vibration signals and acoustic emission signals and fused them for processing. This fusion processing can establish correlations between multiple signal sources. Usually, information fusion can be divided into three levels: data-level fusion, feature-level fusion, and decision-level fusion. Considering that the sample data in this study consist of MTF encoded images of different fault types, it is advantageous to employ CNN for image processing. Therefore, this paper adopted the IFCNN for feature-level fusion of the data.
IFCNN consists of three modules, namely, the feature extraction module, the feature fusion module and the feature reconstruction module [27], and the structure of this framework is shown in Figure 2.

Multisignal Fusion
To enhance system stability and increase diagnostic reliability, this article collected vibration signals and acoustic emission signals and fused them for processing. This fusion processing can establish correlations between multiple signal sources. Usually, information fusion can be divided into three levels: data-level fusion, feature-level fusion, and decisionlevel fusion. Considering that the sample data in this study consist of MTF encoded images of different fault types, it is advantageous to employ CNN for image processing. Therefore, this paper adopted the IFCNN for feature-level fusion of the data.
IFCNN consists of three modules, namely, the feature extraction module, the feature fusion module and the feature reconstruction module [27], and the structure of this framework is shown in Figure 2. The feature extraction module consists of two convolutional layers. The first layer uses the first convolutional layer of the ResNet101 network model, pretrained on the ImageNet dataset. This layer includes 64 convolutional kernels with a size of 7 × 7 and retains the training parameters, enabling effective extraction of image features. The second convolutional layer includes 64 convolutional kernels with a size of 3 × 3, which are used The feature extraction module consists of two convolutional layers. The first layer uses the first convolutional layer of the ResNet101 network model, pretrained on the ImageNet dataset. This layer includes 64 convolutional kernels with a size of 7 × 7 and retains the training parameters, enabling effective extraction of image features. The second convolutional layer includes 64 convolutional kernels with a size of 3 × 3, which are used to adjust the features extracted by the first layer in order to adapt to feature fusion. For this study, the feature fusion module adopts an element-wise maximum fusion strategy. The final module is the image reconstruction module, in which the third convolutional layer includes 64 convolutional kernels with a size of 3 × 3. This layer adjusts the fused convolutional features and plays an important role in reconstructing the image. The fourth convolutional layer reconstructs the feature map with three-channel output, and it includes 3 convolutional kernels with a size of 1 × 1.
This framework uses the mean squared error (MSE) as the basic loss function and adds a perceptual loss to optimize the model. The expression for the perceptual loss (P loss ) is as follows: where f p and f g are the feature maps of the predicted fused image and the true fused image, respectively; i is the feature map channel index; C f , H f and W f are the number of channels, height and width of the feature map, respectively. The expression for the basic loss (B loss ) is as follows: where I p and I g are the predicted fused image and the true fused image, respectively; i is the RGB image channel index; H g and W g are the height and width of the true fused image, respectively. The expression for the total loss (T loss ) is as follows: where w 1 and w 2 are the weighting coefficients. For the fusion of MTF-encoded images in this study, the sums are both set to 1.

Optimized Deep Residual Network
ResNet is built on the basis of CNN and solves the gradient vanishing problem by adding skip connections between the input and output of each convolutional layer. The classic residual module structure is shown in Figure 3.

Optimized Deep Residual Network
ResNet is built on the basis of CNN and solves the gradient vanishing problem by adding skip connections between the input and output of each convolutional layer. The classic residual module structure is shown in Figure 3. The structure contains two mappings, the part of the main path is called the residual mapping and the part of the bypass connection is called the constant mapping. The final output of the residual block is therefore the superposition of the outputs obtained from  The structure contains two mappings, the part of the main path is called the residual mapping and the part of the bypass connection is called the constant mapping. The final output of the residual block is therefore the superposition of the outputs obtained from the two mappings: The structure of the residual network model constructed in this study is shown in Table 3. It includes an input layer, a maximum pooling layer, convolutional layers, an average pooling layer, a fully connected layer and a softmax classifier. Conv2, Conv3, Conv4 and Conv5 are residual modules. Table 3. ResNet model structure.

Layer Name
Kernel Size Channel Stride Padding Output Convolutional layers are the core of CNNs, responsible for extracting features from large amounts of input data. Typically, convolutional layers can be described by the following expression: where x l−1 j is the input of the (l − 1)-th layer of the network; x l j is the output of the l-th layer of the network; k l ij is the weight matrix of the convolutional kernel; b l j is the bias term; M j is the set of input feature maps; σ is the nonlinear activation function; and * represents the convolution operation.
Pooling aims to reduce the size of feature maps while retaining the most important feature information. It can effectively reduce computational complexity and improve the model's robustness and generalization capabilities. The pooling process involves four steps: input feature map, sliding window coverage, feature aggregation, and output feature map. The pooling process can be described by the following expression: where x l−1 j is the input of the (l − 1)-th layer of the network; x l j is the output of the l-th layer of the network; b l j is the bias term; σ is the nonlinear activation function; down(·) is the down-sampling function; and β l j is the weight. To improve the efficiency of fault diagnosis, a convolutional block attention module (CBAM) is introduced to optimize the model by focusing it more on important features [28]. CBAM consists of channel attention module, which captures the connections between channels of the feature map, and spatial attention module, which captures the connections between spatial regions of the feature map.
The channel attention module feeds the features F c avg and F c max obtained after using average pooling and max pooling in the channel dimension into the convolutional network, respectively, and sums the results and outputs them. The process is described as: where σ is a sigmoid function; W 0 and W 1 are convolution operations with a convolution kernel size of 1 × 1. The spatial attention module performs a convolution operation on the features F s avg and F s max obtained after stitching using average pooling and max pooling in the channel dimension. The process is described as: where σ is a sigmoid function; f 7×7 is convolution operation with a convolution kernel size of 7 × 7. This study introduced CBAM into ResNet without changing the overall structure of the network. The input data are MTF feature images of size 224 × 224. After passing through the first convolutional layer with a kernel size of 7 × 7 and a stride of 2, the image size is reduced to 112 × 112. This is followed by a max pooling layer with a stride of 2, which further reduces the data dimensionality and the image size to 56 × 56. The channel attention and spatial attention modules are added sequentially after the batch normalization (BN) layer at the end of the residual modules Conv2, Conv3, Conv4 and Conv5, respectively. After passing through the Conv2, which has 64 channels and convolutional kernels of size 3 × 3 with a stride of 1, deeper features are extracted while maintaining the same image size as the previous layer. The channels in Conv3, Conv4, and Conv5 are doubled successively to 128, 256 and 512. At the same time, down-sampling is implemented in the first convolutional layer with a stride of 2 in each residual module. This results in output image sizes that progressively decrease to 28 × 28, 14 × 14 and 7 × 7, respectively. Afterwards, the network passes through an average pooling layer to reduce the number of parameters and mitigate the occurrence of overfitting. Then, a fully connected layer is used for nonlinear combination of the extracted features, followed by a softmax classifier to produce the final output.
The proposed model uses a cross-entropy loss function to evaluate the error between the predicted and true values, avoiding gradient dispersion, which is defined in the context of a multiclassification problem as: where M is the number of categories; y ic is the sign function, taking 1 if the true value of sample i is equal to c and 0 otherwise; and p ic is the predicted probability that sample i belongs to category c. An initial test was carried out with a constant speed of 1600 rpm and a load of 7 kN, the number of epochs was set to 50 and the loss and accuracy (Acc) in training are shown in Figure 4.
where M is the number of categories; ic y is the sign function, taking 1 if the true value of sample i is equal to c and 0 otherwise; and ic p is the predicted probability that sample i belongs to category c .
An initial test was carried out with a constant speed of 1600 rpm and a load of 7 kN, the number of epochs was set to 50 and the loss and accuracy (Acc) in training are shown in Figure 4.  Overall, from the graph, it can be seen that when the epoch reaches 40, the loss and accuracy have basically converged, and the accuracy has reached nearly 100%. This indicates that the model performs well on the training set and has good generalization ability, Overall, from the graph, it can be seen that when the epoch reaches 40, the loss and accuracy have basically converged, and the accuracy has reached nearly 100%. This indicates that the model performs well on the training set and has good generalization ability, which also verifies that the model structure and parameters chosen in this paper are correct. Setting the number of epochs too large can significantly prolong the training time and even cause overfitting, while setting it too small may not find the global optimal solution. After multiple tests, this paper set the learning rate to 0.001 and the number of epochs to 40, which is a good choice. To intuitively demonstrate the advantages of the proposed method in extracting fault features, this paper utilized the uniform manifold approximation and projection (UMAP) algorithm to perform dimensionality reduction on the data and visualize the results. Taking the steady state condition with a speed of 1600 rpm and a load of 7 kN as an example, this paper conducted a layer-by-layer analysis of ResNet models with and without CBAM and extracted the output features of the intermediate layers for calculation. Then, UMAP is utilized to reduce the dimensionality of the extracted features to two dimensions. This paper extracted the fault features from the avgpool layer and visualized the results using a scatter plot where different fault types are marked with different colors. The visualization is shown in Figure 5. which also verifies that the model structure and parameters chosen in this paper are correct. Setting the number of epochs too large can significantly prolong the training time and even cause overfitting, while setting it too small may not find the global optimal solution. After multiple tests, this paper set the learning rate to 0.001 and the number of epochs to 40, which is a good choice. To intuitively demonstrate the advantages of the proposed method in extracting fault features, this paper utilized the uniform manifold approximation and projection (UMAP) algorithm to perform dimensionality reduction on the data and visualize the results. Taking the steady state condition with a speed of 1600 rpm and a load of 7 kN as an example, this paper conducted a layer-by-layer analysis of ResNet models with and without CBAM and extracted the output features of the intermediate layers for calculation. Then, UMAP is utilized to reduce the dimensionality of the extracted features to two dimensions. This paper extracted the fault features from the avgpool layer and visualized the results using a scatter plot where different fault types are marked with different colors. The visualization is shown in Figure 5. As can be seen from the figure above, there is a significant difference in the clustering degree of data samples between the two models, and introducing CBAM to ResNet can yield more obvious clustering effect in the avgpool layer. Therefore, it can be concluded that the proposed optimized ResNet has excellent abilities in extracting fault features under complex working conditions. As can be seen from the figure above, there is a significant difference in the clustering degree of data samples between the two models, and introducing CBAM to ResNet can yield more obvious clustering effect in the avgpool layer. Therefore, it can be concluded that the proposed optimized ResNet has excellent abilities in extracting fault features under complex working conditions.

Fault Diagnosis Process
This paper proposes a compound fault diagnosis method of rolling bearings based on multisignal fusion and MTF-ResNet. The fused MTF-encoded images are input into the ResNet model for training, and the fault is intelligently diagnosed under different working conditions. The basic process is shown in Figure 6, and the main steps are as follows: (1) acquire vibration and acoustic emission signals; (2) generate feature images of size 224 × 224 by MTF encoding of the original data to build a training set and a test set; (3) fuse the MTF encoded images of the two signals using IFCNN; (4) input the training set into the optimized ResNet model built for training, and save the optimal parameters; and (5) test the test samples and output the results to complete the intelligent fault diagnosis.

Experimental Design
The experimental bearing was selected as NU216 cylindrical roller beari were artificially introduced to the inner and outer rings, as well as the rollin

Experimental Design
The experimental bearing was selected as NU216 cylindrical roller bearing. Defects were artificially introduced to the inner and outer rings, as well as the rolling elements using a YLP-MDF-152 laser marking machine from Han's Laser. Taking into account the failure mechanism of bearings in actual working environments, alternating loads can cause cracks to form at a certain depth below the surface, which may then propagate to the surface and cause spalling. Fatigue spalling increases vibration and noise during rotation and is usually the main form of rolling bearing failure. Therefore, pitting was produced on the surface of the bearing at different locations to simulate early defects. The pitting diameter was set to 40 µm and the depth was set to 30% of the laser energy. Eight types of faults, as described in Section 2, were designed using different fault positions as classification criterion.
In order to simulate the working conditions of metro traction motors, three additional speeds and three additional loads were included in the experimental design. In consideration of both actual working conditions and minimizing the impact of bearing degradation on the experiment, gradient speeds of 800 rpm (low), 1600 rpm (medium) and 2400 rpm (high) were chosen, along with gradient equivalent dynamic loads of 5 kN (light), 7 kN (medium) and 9 kN (heavy) as the radial loads. There are a total of 72 (8 × 3 × 3) subexperiments. The experimental arrangement is shown in Table 4.

Construction of the Signal Acquisition System
This study utilized the intelligent testing platform for comprehensive bearing performance, jointly developed by Henan University of Science and Technology, Luoyang Bearing Research Institute, and Intelligent Numerical Control Equipment Henan Provincial Engineering Laboratory, as the signal acquisition system. The testing machine allows for a maximum inner diameter of 120 mm, a maximum speed of 5000 r/min, a maximum radial load of 300 kN, and a maximum axial load of 200 kN for the bearing. The platform is equipped with a PCI-8 acoustic emission transmitter, two R50S-TC acoustic emission sensors, two LC0151T acceleration sensors, two LC0201-5 signal conditioners, and a PCI8510 data acquisition card.
During the experiment, a healthy bearing and a faulty bearing were installed at both ends of the testing machine's spindle, and vibration and acoustic emission signals were collected from both bearings simultaneously. The loading system applies radial loads to the spindle via a pair of NU2218 cylindrical roller bearings, which in turn are transferred to the test bearings at both ends of the spindle. The sensor signals are amplified and conditioned by signal amplifiers, signal conditioners, and input to the computer through a PCI acquisition card. The principle of the signal acquisition system is shown in Figure 7. The physical set-up of the system is shown in Figure 8. collected from both bearings simultaneously. The loading system applies radial loads to the spindle via a pair of NU2218 cylindrical roller bearings, which in turn are transferred to the test bearings at both ends of the spindle. The sensor signals are amplified and conditioned by signal amplifiers, signal conditioners, and input to the computer through a PCI acquisition card. The principle of the signal acquisition system is shown in Figure 7. The physical set-up of the system is shown in Figure 8.

Diagnostic Scheme Design
To further validate the effectiveness of the proposed method, three types of diagnos tic schemes were designed for single working condition changes, compound working con dition changes, and generic working conditions, considering two different factors (speed and load) that affect the test results.
When studying single working condition changes, first control the speed to be con stant, put data of two different loads in the training set, and put data of another load in the test set to verify the robustness of the model. When controlling the load to be constant the method is similar to the above. The specific diagnostic program is shown in Table 5.

Diagnostic Scheme Design
To further validate the effectiveness of the proposed method, three types of diagnostic schemes were designed for single working condition changes, compound working condition changes, and generic working conditions, considering two different factors (speed and load) that affect the test results.
When studying single working condition changes, first control the speed to be constant, put data of two different loads in the training set, and put data of another load in the test set to verify the robustness of the model. When controlling the load to be constant, the method is similar to the above. The specific diagnostic program is shown in Table 5. When studying the change of compound working condition, it is required that the training set contains data with different speeds and loads at the same time. For generic working conditions, it is required that all fault types data under all conditions exist in both the training and testing sets.

Experimental Results and Comparison of Methods
During the operational process of a metro system, variations in bearing speed and load are inevitable. While previous steady-state tests have certain limitations, it becomes crucial to analyze the results of variable working condition tests to validate the effectiveness of the proposed method. To further explore the changes in compound working conditions, an additional analysis comparing the fusion of acoustic emission and vibration signals with a single signal was incorporated to emphasize the advantages of the proposed method. In the generic working condition tests, the feature extraction capabilities of four models, namely the proposed model, RepVGG, CBAM-CNN and ResNet, were compared to evaluate their performance.

Single Working Condition Changes
Based on the fault diagnosis method proposed in Section 5.3, with the control of constant speed and load, the training set was input into the model constructed in this paper, and fault diagnosis was performed on the test set. The diagnostic results are shown in Table 6. Based on a comprehensive examination of the aforementioned table, it is observed that when maintaining a constant speed while altering the load, the fault diagnosis accuracy reaches nearly 100%. Conversely, in cases where the load remains constant but the speed varies, a decrease in fault diagnosis accuracy is observed, indicating a substantial influence of rotational speed on diagnostic outcomes. Subsequent analysis reveals that the accuracy of items numbered 12, 15 and 18 is significantly low, whereas items numbered 3, 6 and 9 demonstrate accuracy close to 100%, albeit slightly lower than other items within the initial nine numbers. This discrepancy can be attributed to the fact that fault characteristics extracted under medium-to high-speed and medium to heavy load conditions are more discernible compared to those under low-speed and light load conditions.

Compound Working Condition Changes
Mixed data with different speeds and loads were included in the training set and used to train the model proposed for fault diagnosis on the testing set. Subsequently, a comparison was made between the fusion of acoustic emission and vibration signals and using a single signal. The diagnostic results are shown in Table 7. The table clearly indicates that the diagnostic results of items numbered 4 to 6 surpass those of items numbered 1 to 3. Notably, the training and testing sets for items numbered 1 to 3 encompass varying rotation speeds, whereas items numbered 4 to 6 involve different loads. It is observed that the diagnostic accuracy of items numbered 4 to 6 remains relatively stable, whereas item numbered 3 exhibits significantly lower accuracy compared to items numbered 1 and 2. The underlying reason behind this phenomenon aligns with the findings presented in Section 6.1 of this paper.
From the standpoint of signal acquisition, the fusion of acoustic emission and vibration signals yields higher diagnostic accuracy in fault diagnosis compared to utilizing a single signal. This finding provides further substantiation that the application of multisignal fusion technology can effectively enhance system stability and diagnostic accuracy. Furthermore, it is evident that employing a single vibration signal for diagnostics yields superior results in comparison to employing a single acoustic emission signal. This can be attributed to the fact that the acoustic emission acquisition system exhibits heightened sensitivity to environmental noise, primarily stemming from the operational testing equipment, which poses challenges in noise elimination.

Generic Working Conditions
To evaluate the performance of the proposed fault diagnosis model, all fault samples involving three different speeds and three different loads were included in both the training and testing sets. The sample ratio between the two sets was set to 9:1 to ensure the training set was large enough to enable the model to effectively learn the fault data while still reserving an adequate number of samples for testing. Subsequently, the model was applied to diagnose faults on the testing set. To visualize the diagnostic results, a confusion matrix was employed, providing an intuitive and reliable representation of classifications made by the model. The confusion matrix is presented in Figure 9. The confusion matrix provides a clear and intuitive visualization of the model's misclassifications and the types of errors. It can be seen that the overall diagnostic performance is good, and the accuracy rate for the fusion of acoustic emission and vibration signals is almost 100%. However, the diagnosis accuracy rate for label 6, which corresponds to the "outer Ring + rolling element pitting" fault type, is relatively low. The model misclassified three test samples as "rolling element pitting". Further analysis revealed that the two types of faults have similar features, making it difficult to extract differences between them. By comparing (a-c) in Figure 9, the results further confirm that multisignal fusion technology has higher reliability and accuracy compared to a single signal, especially under changing working conditions.
To compare the feature extraction capabilities of different models, the training and testing sets samples of above-mentioned generic working conditions were respectively input into RepVGG, CBAM-CNN and ResNet models for diagnosis. Two types of faults were selected as examples: label 1 (corresponding to "inner ring pitting") with better diagnostic results and label 6 (corresponding to "outer ring + rolling element pitting") with The confusion matrix provides a clear and intuitive visualization of the model's misclassifications and the types of errors. It can be seen that the overall diagnostic performance is good, and the accuracy rate for the fusion of acoustic emission and vibration signals is almost 100%. However, the diagnosis accuracy rate for label 6, which corresponds to the "outer Ring + rolling element pitting" fault type, is relatively low. The model misclassified three test samples as "rolling element pitting". Further analysis revealed that the two types of faults have similar features, making it difficult to extract differences between them. By comparing (a-c) in Figure 9, the results further confirm that multisignal fusion technology has higher reliability and accuracy compared to a single signal, especially under changing working conditions.
To compare the feature extraction capabilities of different models, the training and testing sets samples of above-mentioned generic working conditions were respectively input into RepVGG, CBAM-CNN and ResNet models for diagnosis. Two types of faults were selected as examples: label 1 (corresponding to "inner ring pitting") with better diagnostic results and label 6 (corresponding to "outer ring + rolling element pitting") with poorer results. The precision-recall (PR) curves and receiver operating characteristic (ROC) curves were generated for the optimized ResNet, RepVGG, CBAM-CNN and ResNet models and evaluation indicators, such as average precision (AP) and area under the curve (AUC) were introduced.
The precision-recall (PR) curve is a graphical representation of the performance of a binary classification model, with recall on the x-axis and precision on the y-axis. It illustrates the trade-off between precision and recall at various classification thresholds. The relevant theoretical formulas for the PR curve are as follows: where TP represents the number of true positive instances; FP represents the number of false positive instances; and FN represents the number of false negative instances. The principle of average precision (AP) is to summarize the Precision-Recall (PR) curve by calculating the average precision value. It can be obtained by computing the area under the PR curve. It provides a comprehensive assessment of how well the model balances precision and recall across different recall levels.
The receiver operating characteristic (ROC) curve is a tool used to evaluate the performance of binary classification models. It plots the false positive rate (FPR) on the x-axis and the true positive rate (TPR) on the y-axis. The principle of the ROC curve can be described using the following formulas: where FP represents the number of negative instances incorrectly classified as positive; TN represents the number of negative instances correctly classified as negative; TP represents the number of positive instances correctly classified as positive; and FN represents the number of positive instances incorrectly classified as negative. Area under the curve (AUC) is obtained by calculating the area under the ROC curve. The resulting AUC value ranges from 0 to 1, where 0.5 represents a random classifier and 1 represents a perfect classifier. A higher AUC value indicates better classifier performance.
The diagnostic results are presented in the form of PR and ROC curves in Figures 10  and 11. The overall accuracy rate, AP and AUC for all fault types were calculated for the four models, and the weighted average values were recorded in Table 8.     Generally, the closer the PR curve in Figure 10 is to the upper right corner, the larger the AP value, and the better the model performance. The closer the ROC curve in Figure  11 is to the upper left corner, the larger the AUC value, and the better the model performance. Observing the figure above, it can be seen that for the two selected fault types with different diagnostic effects, the PR and ROC curves of proposed model are both closer to  Generally, the closer the PR curve in Figure 10 is to the upper right corner, the larger the AP value, and the better the model performance. The closer the ROC curve in Figure 11 is to the upper left corner, the larger the AUC value, and the better the model performance. Observing the figure above, it can be seen that for the two selected fault types with different diagnostic effects, the PR and ROC curves of proposed model are both closer to the rightangle edge than those of RepVGG, CBAM-CNN and ResNet, indicating better performance. Combined with the data in Table 8, the three accuracy evaluation indicators of the proposed model are higher than those of the compared models, validating the good feature extraction ability of the proposed model.

Conclusions
This paper focused on the study of the feature extraction ability of the model for complex working conditions, using the metro traction motor bearings as the research object. On the basis of ResNet, CBAM was introduced to optimize the ResNet model. Nine different working conditions and eight compound fault types were designed for experimentation. In addition, a dataset was constructed using MTF image encoding and IFCNN image fusion technology. During the model training process, UMAP was used for visualization to intuitively demonstrate the feature extraction effect of the proposed model. After the experiment, three evaluation indicators were used for objective evaluation of the feature extraction ability of the optimized ResNet, RepVGG, CBAM-CNN and ResNet models.
The results of the experiment show that the MTF-ResNet model with multisignal fusion performs well under complex working conditions, with a diagnostic accuracy rate of up to 99.25%. Based on the results, some important conclusions can be drawn. Specifically, in terms of sensors, using only vibration signals produces better diagnostic results than using only acoustic emission signals. In addition, compared with a single signal, using acoustic emission and vibration signal fusion can provide more comprehensive and integrated information, while reducing misclassifications caused by the limitations of a single signal, thereby improving fault diagnosis accuracy and making the diagnosis result more reliable. In terms of data processing, MTF image encoding technology is a simple data processing method that retains the time correlation of the data, making it easier for the model to extract more comprehensive fault features. For feature extraction models, introducing CBAM after the batch normalization layers of the ResNet model can make the model more focused on capturing important features, quickly distinguishing different types of fault features, and improving diagnostic efficiency. Furthermore, the ResNet structure can effectively alleviate the gradient disappearance phenomenon that occurs as the network deepens, thereby preventing model degradation.
Undoubtedly, this study presents several avenues for future research in the proposed methodologies. Firstly, the inclusion of additional sensors or exploration of different sensor types holds promise. For instance, incorporating multidirectional vibration sensors or temperature sensors could offer a more comprehensive spectrum of fault information, thereby enhancing diagnostic fault tolerance. Secondly, exploring more advanced data processing techniques warrants investigation to enhance the quality of input signals. The acoustic emission signals acquired in this study exhibited significant levels of environmental noise that proved challenging to eliminate. Therefore, employing sophisticated techniques may substantially improve the value derived from these acoustic emission signals. Moreover, conducting model testing on larger datasets utilizing more complex compound faults can effectively confirm the feature extraction capabilities and generalization of the model. This approach will serve as a more robust means of validation. Furthermore, future research focusing on feature extraction models should prioritize the development of lightweight and efficient models to facilitate practical implementation.
Despite the inherent limitations of the methods proposed in this paper, they exhibit commendable feature extraction capabilities within intricate operational scenarios. Consequently, these methods hold potential for application in fault diagnosis tasks related to metro traction motor bearings, thereby possessing appreciable value in engineering applications.