1. Introduction
Rolling element bearings are critical transmission components in modern electromechanical systems, and their health condition directly affects equipment reliability, operational safety, and maintenance cost [
1,
2]. However, because of prolonged operation, variable loads, and harsh industrial environments, bearings are highly susceptible to faults such as wear, fatigue, and spalling. Once such faults develop, they may cause performance degradation, machine downtime, and even serious safety accidents. Therefore, timely and accurate bearing fault diagnosis remains an important topic in condition monitoring and intelligent maintenance.
Traditional rolling bearing fault diagnosis methods mainly rely on signal analysis in the time domain, frequency domain, and time–frequency domain [
3,
4,
5,
6]. Time-domain methods usually extract statistical indicators, such as root mean square, kurtosis, skewness, crest factor, and impulse factor, to characterize fault-related impulsiveness and amplitude variation [
3]. Frequency-domain approaches focus on characteristic fault frequencies and their harmonics through spectral analysis, envelope demodulation, or spectral kurtosis [
4]. Time–frequency methods, such as short-time Fourier transform, wavelet transform, empirical mode decomposition, and variational mode decomposition, are particularly suitable for nonlinear and non-stationary bearing signals [
5,
6]. Although these methods have achieved useful results, they often depend on handcrafted features and expert knowledge, and their performance may degrade under strong noise, sensor variability, and complex operating conditions.
Among the available sensing modalities, vibration signals are the most widely used in bearing diagnosis because they directly reflect the dynamic responses of mechanical structures [
7,
8,
9,
10,
11,
12]. For example, Wang et al. [
7] combined improved multiscale weighted permutation entropy with an optimized support vector machine to improve diagnosis robustness under complex conditions. Ni et al. [
9] proposed a fault-information-guided variational mode decomposition method to extract intrinsic mode functions associated with bearing defects more effectively. In the deep-learning domain, Zhao et al. [
11] developed a variable-speed diagnosis framework based on a deep branched attention network and multiscale entropy, while Li et al. [
12] introduced a semi-supervised graph convolutional network using images transformed from vibration signals. These studies demonstrate the effectiveness of vibration-based diagnosis, but vibration measurement usually requires contact sensors and may be affected by installation position and environmental interference.
In contrast, acoustic signals provide several attractive advantages, including non-contact measurement, flexible deployment, and sensitivity to certain early-stage fault signatures [
13,
14,
15,
16,
17,
18]. Shiri et al. [
15] used acoustic sensing in a robotic platform to monitor conveyor belt rollers and improved the safety and effectiveness of fault detection. Hou et al. [
16] investigated acoustic-emission-based diagnosis for high-speed train wheelset bearings and established a mathematical model considering elastic distortion and signal attenuation. Sun et al. [
17] employed acoustic diagnosis with multiscale fractional permutation entropy in railway turnout machines, and Tang et al. [
18] achieved high-accuracy gearbox fault diagnosis through joint use of thermal imaging and acoustic information. Nevertheless, acoustic signals are more vulnerable to environmental noise and often have a lower signal-to-noise ratio than vibration signals. For this reason, relying on a single sensing modality may not be sufficient to fully describe complex bearing health states.
To overcome the limitations of single-sensor diagnosis, multimodal fusion of acoustic and vibration signals has attracted increasing attention [
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30]. Because these two modalities are complementary in physical meaning, frequency sensitivity, and acquisition characteristics, their fusion can provide a more comprehensive representation of bearing fault information [
19,
20]. Yan et al. [
21] proposed a fusion method based on a secondary convolutional neural network and reported improved diagnosis accuracy. Tang et al. [
22] showed that the combination of vibration and acoustic-emission signals can more comprehensively reflect bearing operating conditions. Buchaiah et al. [
23] introduced a feature extraction and selection framework for multimodal data fusion to improve predictive accuracy and stability. Saufi et al. [
24] fused raw signals from different sensors through a DE-1D-CNN. GUO et al. [
25] designed a fault diagnosis method combining feature extraction and word bag model. Pacheco-Cherrez et al. [
26] provided a new versatile approach to enhance predictive maintenance methodology in rotating machinery using vibration and acoustic signals. Chu et al. [
27] and Liu et al. [
28] further extended multimodal diagnosis toward cross-domain and domain-adaptive settings. Li et al. [
29] and Li et al. [
30] also improved classification accuracy by effectively integrating acoustic and vibration signals. Despite this progress, many existing fusion methods still rely on direct feature concatenation, manually designed fusion strategies, or modality-agnostic integration, which may not fully exploit modality complementarity at the representation-learning level.
In parallel, graph neural networks (GNNs) have shown strong capability in modeling non-Euclidean relationships and high-order dependencies between samples [
31,
32,
33,
34,
35,
36]. By representing each sample as a node and its similarity relations as edges, GNNs can capture structural information that is difficult to exploit using conventional sample-independent networks. Liu et al. [
32] introduced a heterogeneous GNN model to capture complex relational patterns, while Yong et al. [
33] proposed a bearing diagnosis method combining dual-path CNNs and multiple graph convolutional networks. Liu et al. [
34] developed a dynamic temporal GNN for multivariate time-series classification, and Kavianpour et al. [
35] and Ghorvei et al. [
36] further demonstrated the value of graph-based representation learning under domain adaptation and varying working conditions. These studies indicate that graph learning can enhance diagnosis performance by exploiting inter-sample structural relations. However, in multimodal bearing diagnosis, two challenges remain insufficiently addressed: first, how to adaptively weight different sensing modalities before fusion; and second, how to preserve modality-discriminative characteristics while performing graph-based relational learning. Ait Ichou et al. [
37] proposed a deep multimodal learning framework for heart sound classification by integrating CNNs, Transformers, and BiLSTM with attention, showing that multimodal feature fusion and hybrid sequence modeling can significantly improve classification robustness in noisy and non-stationary acoustic signals.
To better position this study with respect to prior acoustic–vibration fusion and graph-based diagnosis methods, the novelty of the proposed MTMAGNet does not lie in the isolated use of CNNs, attention, GCNs, or multi-task learning, since each of these components has already been established in previous studies. Instead, the key novelty lies in how these components are organized into a unified multimodal relational-learning framework. Specifically, unlike existing acoustic–vibration fusion approaches that mainly rely on direct concatenation or decision-level fusion, the proposed method introduces a modality-level attention mechanism to adaptively reweight acoustic and vibration embeddings before fusion. Unlike conventional graph-based diagnosis methods that usually construct graphs from single-modal features, MTMAGNet performs graph learning on the fused multimodal representation, so that inter-sample relations are established after cross-modal interaction. In addition, an auxiliary modality-classification task is introduced to preserve modality-discriminative characteristics during feature learning, thereby improving the quality of the fused representation and enhancing generalization.
Accordingly, this paper proposes a Multi-Task Multimodal Attention Graph Convolutional Network (MTMAGNet) for acoustic–vibration fusion-based rolling bearing fault diagnosis. The main contributions of this work are summarized as follows:
- (1)
A modality-attention-based acoustic–vibration fusion strategy is proposed to adaptively exploit the complementary information of the two sensing modalities;
- (2)
A k-nearest-neighbor graph is constructed on the fused multimodal embeddings so that inter-sample structural relations can be learned by a graph convolutional network;
- (3)
A multi-task learning scheme is introduced, in which fault classification is treated as the main task and modality classification is used as an auxiliary task to regularize representation learning and improve generalization.
The remainder of this paper is organized as follows:
Section 2 introduces related work, covering theories on multimodal fusion, graph neural networks, and multi-task learning, and details the proposed method, including its structure, algorithm implementation, and technical specifics.
Section 3 presents the experimental details.
Section 4 provides experimental validation and result analysis, demonstrating the effectiveness and superiority of the proposed method compared to existing approaches.
Section 5 summarizes the main contributions and discusses the method’s limitations and potential future research directions.
2. Proposed Method
This section provides a detailed description of the MTMAGNet method, which serves as the core component of the proposed multi-modality diagnostic framework. Following this, the internal mechanism of the modality attention is explained, with a focus on how adaptive weights are assigned to acoustic and vibration signals. The specific implementation of the MTMAGNet model, including graph convolutional layers, feature fusion strategy, and optimization approach, is also presented.
2.1. Feature Extraction from Acoustic and Vibration Data
Assume that the acoustic and vibration signals are discrete time series with a sampling frequency of fs. The acoustic and vibration signal values at each discrete time point
n are denoted as
Xa[n] and
Xv[n], respectively, where n = 1, 2, …, N, and there are a total of N sampling points. The discrete acoustic signal can be represented as:
Similarly, the discrete vibration signal can be represented as:
In these equations, represents the acoustic signal value at the n-th synchronized time point, and represents the vibration signal value at the same time point.
The core of feature extraction lies in transforming the input time series signals
Xa and
Xv into corresponding low-dimensional feature vectors ha and hv. In this paper, we use convolutional neural networks (CNNs) to extract local features while reducing dimensionality. The convolution operation is a key module for capturing local patterns from input signals. The convolution can be expressed mathematically as follows:
where
is the synchronized output of the
l-th convolutional layer at time t.
represents the convolution kernel weights of size
k, where k is the window size used to calculate local statistics over the temporal dimension.
b(
l) is the bias term. The
function refers to the activation function. By using this one-dimensional convolution, we can capture local temporal relationships and dependencies from the time-domain signal.
Pooling layers are introduced to reduce the temporal dimensionality while preserving key features. A common pooling strategy is Max Pooling, whose mathematical formula is:
where
represents the pooled features, and s is the size of the pooling window. By applying pooling, we can reduce the temporal dimensionality while retaining the most salient local features, which is beneficial for improving the robustness of the extracted features.
Specifically, acoustic and vibration signals are processed through multiple layers of one-dimensional convolutions to extract local patterns, followed by down-sampling through pooling layers. The final output is then projected into low-dimensional feature spaces through fully connected layers. The entire feature extraction process can be mathematically described as:
where Conv(•) denotes the convolution operation, σ(•) is the activation function, and Pool(•) represents the pooling operation. The output is a d-dimensional feature vector.
Through convolution operations, the model can capture local signal patterns, such as peaks and periodic characteristics, which helps the model to reduce the impact of noise. Pooling layers help reduce the data dimensionality, avoiding overfitting while preserving key information. By passing through multiple convolutional and pooling layers, the raw time-series signals are compressed into low-dimensional feature vectors, which can be more effectively fused in the subsequent multi-modality integration and classification tasks. Although the acoustic and vibration branches adopt the same feature-extraction architecture, their parameters are not shared. This design is necessary because the acoustic branch takes a 16-channel input, whereas the vibration branch takes a single-channel input. Therefore, the two branches have the same layer configuration but are independently parameterized to better adapt to the different statistical characteristics of the two sensing modalities.
2.2. Modality Attention Mechanism
The modality attention mechanism calculates attention scores sa and sv for the acoustic and vibration modalities, respectively, to reflect the relative importance of each modality. The attention scores are computed using a nonlinear activation function, tanh, to capture complex inter-modal relationships. The attention score for the acoustic modality is defined as follows:
where
represents the feature vector extracted from the acoustic signal, which is the output from the feature extractor.
is a learnable weight matrix that maps acoustic features to a hidden space, with a dimension of d.
is the bias term, and
is the learnable attention vector, which transforms the activated features into a scalar attention score.
Similarly, the attention score for the vibration modality is defined as:
In a similar fashion,
represents the feature vector extracted from the vibration signal.
is the bias term, and
is the attention vector shared with the acoustic modality. After obtaining the attention scores for both modalities, softmax normalization is applied to compute the attention weights
and
, representing the relative importance of each modality during the fusion process. The attention weights for acoustic and vibration modalities are defined as follows:
From Equations (9) and (10), it can be observed that the sum of the attention weights for both modalities equals 1, ensuring that the modalities are fused in a weighted manner.
After computing the attention weights for the acoustic and vibration modalities, the final fused feature vector is obtained by combining the two modality features in a weighted sum:
where
is the fused feature vector.
and
represent the attention weights for each modality, which are applied to the acoustic and vibration feature vectors, respectively. The fused feature vector hfused combines information from both modalities, reflecting the varying importance of each modality’s features, thus providing a rich representation for subsequent classification tasks.
In this work, a lightweight additive attention formulation is adopted for modality weighting. Compared with dot-product attention, this design is more suitable for the present two-modality fusion scenario because the acoustic and vibration features are heterogeneous in sensing characteristics and may not be directly comparable through inner-product similarity alone. The nonlinear projection with tanh allows each modality to be mapped into a shared latent space before scoring, while the learnable vector provides a simple and stable mechanism for estimating modality importance with low computational overhead. Since the purpose here is to assign a global fusion weight to each modality rather than to model complex token-wise interactions, the adopted formulation is sufficient and computationally efficient.
2.3. Graph Construction
To capture structural relationships among samples, a k-nearest-neighbor (kNN) graph is constructed in the fused multimodal feature space, where edges connect samples with high representational similarity. This graph-based structure enables the subsequent GCN to exploit inter-sample relational information beyond individual feature extraction. To ensure the graph remains consistent with evolving representations during training, the kNN graph is dynamically reconstructed at each epoch based on the latest feature embeddings, rather than being fixed at initialization. This dynamic update strategy prevents the adjacency structure from becoming misaligned with the learned feature space, thereby improving the effectiveness of graph-based feature propagation.
The feature matrix H, composed of fused feature vectors
is represented as follows:
where N denotes the total number of samples, and d represents the feature dimension for each sample. To construct the KNN graph, the Euclidean distance is used to measure the similarity between samples, thereby determining the connections between nodes. For each node, the K nearest neighbors are selected, and an adjacency matrix is constructed based on the similarity between samples. Each edge in the adjacency matrix represents the similarity between the connected samples.
After graph construction, the fused multimodal features are processed by a graph convolutional network (GCN). The graph propagation rule adopted in this work follows the classical symmetrically normalized GCN formulation proposed by Kipf and Welling [
38].
For each node i, a linear transformation is applied to project the original feature vector into a new feature space:
where
is the original feature vector of node i,
is a learnable weight matrix, and
is the transformed feature vector. For node i, the aggregated feature representation
is defined as:
where
denotes the set of neighboring nodes of node
i, and deg(
i) and deg(
j) represent the degrees of nodes
i and
j, respectively.
W is the weight matrix shared across the layer.
is a nonlinear activation function.
The normalized adjacency matrix is defined as follows:
where
A is the adjacency matrix of the graph.
D is the degree matrix, and
.
After aggregating information from neighboring nodes, the output layer of GCN uses a classifier to predict the class of each node. The classification expression is as follows:
where
is the weight matrix of the classifier,
is the bias term of the classifier, and C is the number of classes. The construction and iteration process of GCN is shown in
Figure 1.
2.4. Multi-Task Learning
In the multi-task learning framework, the model is required to simultaneously learn both the primary and auxiliary tasks. The primary task is fault classification, which predicts the class label for each sample. The auxiliary task is modality classification, which is applied to the modality-specific feature vectors
ha and
hv before fusion. The purpose of this auxiliary task is to preserve modality-discriminative information during representation learning, thereby improving the quality of multimodal fusion and enhancing generalization.
In this equation, is the weight matrix of the modality classifier, which projects the fused feature onto a two-dimensional probability space, representing the acoustic and vibration modalities. is the bias term of the modality classifier. To train the modality classifier, the true labels ymodality for each modality are defined with a one-hot encoding as follows:
Through the modality classification task, the model can better distinguish features originating from different modalities, and this supervised learning helps to improve the quality of the fused features.
2.5. Loss Function Construction
In the multi-task learning process, the model’s loss function consists of the primary classification task loss and the auxiliary modality classification task loss. To enable the model to achieve optimal performance on both tasks, the final loss is a weighted sum of the primary and auxiliary task losses.
The loss function for the primary classification task employs cross-entropy loss, which measures the difference between the predicted and true class labels. The classification task loss is defined as:
where N is the number of samples, C is the number of classes,
represents the true label of sample
i for class
j, and
is the predicted probability of the model that sample
i belongs to class
j. Cross-entropy loss calculates the divergence between the true label and the predicted probability, thus evaluating the classification accuracy of the model. If the model’s predicted probability closely matches the true label, the loss will be small; otherwise, it will be larger.
The modality classification task balances the accuracy of the model’s modality classification, and its loss function is defined as follows:
where
is the modality label for sample i, represented as a one-hot encoded vector. For acoustic features, the label is [1, 0]; for vibration features, the label is [0, 1].
is the model’s predicted probability. The modality classification loss ensures the model can correctly distinguish between the sources of fused features.
The total loss is a weighted sum of the classification and modality classification losses, and is expressed as:
where λ
modality and λ
modality are weight parameters. These weights adjust the importance of the primary and auxiliary tasks in the total loss function. Through this approach, the model can jointly optimize the primary classification task and the auxiliary modality classification task, allowing for flexible adjustment of task weights according to specific application scenarios and enhancing the model’s performance. Accordingly, the auxiliary loss λ
modality is used to regularize the shared representation learning process, while λ
modality remains the primary optimization target for fault diagnosis.
The Adam optimizer is ultimately used to optimize all parameters of the model. Algorithm 1 provides the implementation process for fault diagnosis in the MTMAGNet data fusion model.
| Algorithm 1: MTMAGNet Fusion Algorithm for Intelligent Fault Diagnosis |
| Require: Acoustic signal dataset Xa, Vibration signal dataset Xv |
| Ensure: Classification results of the fused signals |
1: for each data file i = 1 to N do 2: Read the acoustic signal and the vibration signal 3: Obtain the corresponding class label 4: end for 5: Perform normalization on and to obtain the normalized signals and 6: Define the FeatureExtractor fFE 7: Define the ModalityAttention 8: DefCompute attention weights via softmaxine the GCN 9: Define the modality classifier MultiTaskModel 10: Initialize the model parameters and the optimizer 11: for epoch from 1 to num_epochs do 12: for each training batch do 13: Define modality labels: Acoustic modality label ya = 0, Vibration modality label ya = 1 14: Input and to the FeatureExtractor to obtain feature vectors:
, 15: Compute attention weights using ModalityAttention:
Compute attention weights via softmax:
16: Fuse features to obtain the fused feature vector: 17: Construct a KNN graph based on hfused to obtain the adjacency matrix or edge_index 18: Input hfusedand edge_index into the GCN to obtain classification logits: 19: Compute modality logits for acoustic and vibration modalities using the modality classifier:
, 20: Compute the main task classification loss and auxiliary modality classification loss: Main task loss: , Modality classification loss:
21: Compute the total loss: 22: Perform backpropagation and update model parameters using the optimizer 23: end for 24: Update the learning rate scheduler 25: end for 26: for each testing batch do 27: Repeat steps 13 to 20 from the training phase without performing backpropagation 28: end for |
2.6. MTMAGNet Intelligent Diagnosis Framework
The proposed MTMAGNet method is an advanced intelligent fault diagnosis framework that extracts signal features using convolutional networks, fuses these features via the MMA, models sample relationships with GCN, and optimizes classification tasks through multi-task learning.
Figure 2 provides an overview of the framework, which consists of the following four key steps:
Step 1: Acquire acoustic and vibration signals from rolling bearings of mechanical equipment operating under various conditions. These signals serve as the primary input for fault diagnosis, capturing crucial information about the equipment’s operational state.
Step 2: Use CNN to extract deep features from the acquired acoustic and vibration signals. Then, apply the MMA to fuse these features. Attention weights are computed to highlight the most relevant features from each modality, resulting in fused feature representations.
Step 3: Construct a graph representation of the data using the fused feature vectors, where nodes represent the samples and edges are formed based on their similarity. The graph structure is created using the KNN algorithm. The fused features and graph are then passed to the GCN, which captures both local and global relationships among the samples, yielding enhanced feature representations.
Step 4: The GCN outputs are input into a MTL for fault classification, which includes both the main task and the auxiliary task. A total loss function is defined, and the model is trained by minimizing this loss, enabling accurate fault diagnosis based on the fused multi-modal signals.
3. Experiment
In order to verify the effectiveness of the algorithm proposed in this paper, we set up a rolling bearing acoustic and vibration test bench as shown in
Figure 3. The experimental system mainly comprises a motor, controller, shaft, test bearing, acoustic array sensors, and data acquisition system. The acoustic array sensor consists of 16 acoustic sensors of model BSWA MPA416, each with a sensitivity of 50 mV/Pa. The acoustic subsystem consists of a 16-channel microphone array. During data acquisition, the 16 microphone channels are synchronously sampled and stored as multichannel acoustic recordings. For each sample, temporally aligned 1024-point segments are extracted simultaneously from all 16 channels and stacked to form a 16-channel acoustic input tensor.
These sensors are arranged in a circular array mounted on a disk, forming an acoustic array board. To ensure signal quality and reduce signal attenuation, the acoustic array board is installed on a plane 200 mm away from the end of the bearing under test, with its center aligned with the bearing axis. During the operation of the bearing, the acoustic array sensors collect acoustic signals and transmit them to a computer terminal via a PAK MKII-SC42 data acquisition system, which converts the analog signals into digital signals. To meet the requirements of the sampling theorem, the sampling frequency of the acoustic signals is set to 16,384 Hz. A vibration sensor is installed on the bearing housing of the bearing under test to collect vibration signals. These signals are collected using a BSZ800D-16 data acquisition device, with the sampling frequency also set to 16,384 Hz. The vibration data acquisition device is connected to the computer terminal and is used alongside the acoustic signals for subsequent data analysis. This experiment investigates faults of the outer race, inner race, and rolling elements of rolling bearings under a fixed rotational speed of 2400 rpm. The geometric parameters of the bearing are shown in
Table 1.
Figure 4 presents a set of time-domain signals of vibration and acoustical data collected from the test bench for seven different fault types. Each subfigure separately shows the amplitude variations in vibration and acoustical recordings over time under specific fault conditions. In the left column of each figure, the vibration signals exhibit a wide range of amplitude fluctuations across various fault conditions, indicating potential differences in energy distribution and fault impact intensity. The right column displays the acoustical signals, whose amplitude variations are consistent with the fault characteristics depicted by the vibration signals. These signals provide foundational insights into the characteristics of each fault type, which will be utilized in the feature extraction and classification stages of the proposed multi-modal diagnostic framework.
To construct the bearing fault dataset, experiments were conducted at a fixed working speed of 2400 rpm. Considering different fault types, we constructed seven different operating conditions for the rolling bearing, including two normal operating conditions and six fault conditions such as outer race faults, inner race faults, rolling element faults, and some coupled faults. Each condition was assigned a label from C0 to C6 to represent different bearing states. For both acoustic and vibration signals, 400 samples were collected for each state, totaling 2800 samples per type of signal, with 20% used for validation. Detailed information is listed in
Table 2.
The proposed model is implemented using Python 3.10 and Torch-gpu 1.11.0, taking advantage of the computational capabilities provided by an NVIDIA GeForce GTX 3070 GPU, paired with an 11th Gen Intel(R) Core(TM) i7-11700K processor running at 3.60 GHz, and 48 GB of RAM. The network architecture and parameter configuration are detailed in
Table 3, which describes the structural components of each layer in the model. The FeatureExtractor module consists of three 1D convolutional layers, with kernel sizes of 5, 5, and 3, respectively, and increasing input sizes of 1, 64, and 128, resulting in output sizes of 64, 128, and 256. A dropout layer with a probability of 0.5 is incorporated after the convolutional layers to mitigate overfitting, followed by a fully connected layer that outputs a feature size of 128. The ModalityAttention module includes three linear layers with an input size of 128 for the first two layers and 64 for the third, transforming the features to facilitate multi-modality fusion. In the GraphLayer, two GCNConv layers are employed with output sizes of 64 and 7, respectively, to extract graph-based features. The final ModalClassifier comprises a fully connected layer with an output size of 2 to distinguish between different modalities.
The hyper-parameter settings for the model are summarized in
Table 4. The batch size for both training and testing is set to 64, with optimization handled by the Adam optimizer. A learning rate of 0.0001 and a weight decay of 1 × 10
−4 are used. The lr scheduler is adopted for learning-rate decay, with step_size = 50 and gamma = 0.5. The number of epochs is fixed at 100. Additionally, the hyper-parameters include
λclass = 1.0 and
λmodality = 0.4 to balance the loss functions of the classification and modality tasks. The window size for feature extraction is set to 1024, and the number of nearest neighbors nneighbors is 5, with each file containing 400 samples.
To avoid information leakage between training, validation, and testing, the data split was performed before graph construction, and each split was processed independently throughout the subsequent pipeline. Specifically, acoustic and vibration recordings belonging to different operating conditions were first assigned to the training, validation, and testing subsets. Signal segmentation was then conducted separately within each subset, and no segmented sample derived from one original recording was allowed to appear in more than one subset. In addition, the k-nearest-neighbor graph was constructed independently for each subset based only on the samples within that subset. Therefore, no test sample participated in the graph construction, feature propagation, or hyperparameter selection during training. All normalization operations were also performed separately after the data split, so that no statistical information from the validation or test sets was used in model training.
The number of neighbors k was selected based on validation performance. We evaluated k ∈ {3, 5, 7, 9, 11} and found that k = 5 achieved the best trade-off between local structural preservation and classification stability. A too-small k led to insufficient neighborhood information, whereas a too-large k introduced redundant inter-class connections. As shown in
Table 5, the model achieved the best validation/test performance at k = 5; therefore, this value was adopted in all subsequent experiments.
4. Diagnostic Analysis
4.1. Experiment Result
To evaluate the impact of each component within the proposed MTMAGNet model on overall performance, we conducted ablation experiments by individually removing or simplifying key modules and observed the changes in test accuracy over training iterations. The methods for each ablated module are described as follows: (a) to verify the effectiveness of the multimodal information fusion method, we first assessed the fault diagnosis performance using single-modal data, employing only the combination of vibration signals and the GCN model, denoted as Only Vibration with GCN (OV-GCN); (b) using only the combination of acoustic signals and the GCN model, denoted as Only Acoustic with GCN (OA-GCN); (c) fusing acoustic and vibration signals but removing the modal attention mechanism while retaining the multi-task learning module, denoted as MTL-GCN; (d) fusing acoustic and vibration signals but removing the multi-task learning module while retaining the modal attention mechanism, denoted as MMA-GCN; and (e) fusing acoustic and vibration signals but simultaneously removing both the modal attention mechanism and the multi-task learning module, retaining only the GCN model, referred to as VA-GCN. The results of the ablation experiments are shown in
Figure 5.
From
Figure 5, it can be observed that: The complete MTMAGNet model maintained a test accuracy close to 100% throughout the training process, outperforming all other configurations. The single-modal models OV-GCN and OA-GCN exhibited lower performance, especially OA-GCN, which showed significant fluctuations in accuracy and was notably inferior to the other models. The MMA-GCN and MTL-GCN models, which retained either the modal attention mechanism or the multi-task learning module, showed significant improvements in test accuracy, reaching approximately 95%. This indicates that each module plays a crucial role in multimodal data processing and feature fusion. The VA-GCN model, which simultaneously removed both modules, performed worse than the complete model, confirming the necessity of the modal attention mechanism and multi-task learning in enhancing classification accuracy. The ablation experiment results clearly demonstrate the unique contributions of each module; the synergistic effect of the multimodal attention mechanism and multi-task learning enables the MTMAGNET model to exhibit outstanding performance in multimodal fault diagnosis tasks.
To further verify the effectiveness of the MTMAGNet model under multi-modal fusion,
Table 6 summarizes the test results from eight ablation experiments. These results showcase the performance of different models in terms of maximum test accuracy (Max-acc), minimum test accuracy (Min-acc), average test accuracy (Average-acc), and the standard error of the average accuracy. This data provides quantitative evidence for assessing the impact of each model component on overall performance. From the results, it is evident that the complete MTMAGNet model performed exceptionally well across all experiments, achieving a maximum test accuracy of 100%, a minimum of 99.11%, an average accuracy of 99.78%, and a standard error as low as 0.31%. This demonstrates that the collaborative effect of multi-modal fusion, the modal attention mechanism, and multi-task learning enables the model to attain extremely high classification accuracy and stability. In contrast, the MMA-GCN model—saw its average accuracy decrease to 96.55%, with the standard error increasing to 0.41%. This indicates that multi-task learning plays a significant role in enhancing feature extraction efficiency and the model’s generalization ability. The MTL-GCN model, which lacks the modal attention mechanism, achieved an average accuracy of 94.90% with a standard error of 0.79%, highlighting the crucial role of the modal attention mechanism in effectively fusing multi-modal features. The VA-GCN model, retaining only the GCN component, had an average accuracy of 94.35% and a standard error of 0.81%, significantly lower than the complete model. This underscores the necessity of both the modal attention mechanism and multi-task learning. The single-modal models, OV-GCN and OA-GCN, achieved average accuracies of 91.81% and 86.16%, with standard errors of 0.94% and 1.39%, respectively, performing the worst among all models. This further confirms that single-modal data cannot fully exploit the model’s potential, emphasizing the importance of multi-modal fusion. From the ablation results, it can be observed that introducing the auxiliary modality-classification task improves rather than degrades the performance of the full model. In particular, the complete MTMAGNet achieves higher average accuracy than the variant without multi-task learning, indicating that the auxiliary loss provides effective regularization rather than harmful negative transfer under the present setting.
4.2. Model Performance Analysis
To comprehensively evaluate the sensitivity of the MTMAGNet model to hyperparameter configurations,
Figure 6 depicts the relationship between diagnostic accuracy and the hyperparameters Lambda Modality and Lambda Class in a three-dimensional surface plot. In this visualization, the
X-axis and
Y-axis correspond to Lambda Modality and Lambda Class values, respectively, while the
Z-axis represents the average diagnostic accuracy achieved by the model. From the figure, it is evident that the model reaches peak diagnostic accuracy within the moderate hyperparameter range of approximately 0.4 to 0.6 for both Lambda Modality and Lambda Class. This region forms a plateau, indicating stable and consistently high model performance. Such optimal performance confirms that appropriately balanced task weights allow the model to exploit the complementary information from both tasks effectively. Conversely, when Lambda Modality and Lambda Class are set toward extreme values, a clear decrease in accuracy accompanied by greater fluctuation emerges. Particularly noticeable is the sharp performance drop when both hyperparameters approach higher values, suggesting that excessive emphasis on either modality attention or classification tasks can negatively impact the model’s generalization ability. This phenomenon likely results from an imbalance that induces overfitting to one particular task, limiting the model’s capacity to harness multi-task synergies.
To further substantiate the selection of the loss weights, a systematic grid search was conducted over Lambda Modality and Lambda Class. Specifically, Lambda Class was varied in {0.2, 0.4, 0.6, 0.8, 1.0}, and Lambda Modality was varied in {0.1, 0.2, 0.4, 0.6, 0.8}. For each parameter pair, the model was trained and evaluated under the same experimental protocol, and the corresponding average accuracy and Macro-F1 were recorded. The results are summarized in
Table 7. It can be observed that the performance varies systematically with the two loss-weight parameters, and the combination Lambda Modality = 0.4 and Lambda Class = 1.0 achieves the best overall trade-off among the tested settings. This result is consistent with the trend shown in
Figure 6, where the model exhibits the most stable and highest performance in the moderate parameter region. Therefore, they were adopted in all subsequent experiments.
In order to further analyze the model’s attention allocation to acoustic and vibration features across different fault categories, the mean attention weights extracted by the MTMAGNet model on the test set are depicted in
Figure 7. In this figure, the horizontal axis represents different fault categories (Class 0 to Class 6), and the vertical axis denotes the mean attention weight. The yellow bars represent the mean attention weights allocated to acoustic signals, while the orange bars illustrate those for vibration signals, with the error bars indicating standard deviations. The results clearly reveal varying patterns of attention allocation between the acoustic and vibration modalities across different fault categories. For instance, in Class 0 and Class 2, attention weights for acoustic and vibration signals are relatively balanced, indicating that the model equally considers both modalities when extracting discriminative features. In contrast, for Class 1, Class 3, Class 5, and Class 6, attention weights assigned to vibration signals are significantly higher than those assigned to acoustic signals, suggesting a stronger reliance on vibration-based features by the model in these fault categories. Specifically, Class 3 exhibits the greatest disparity, highlighting vibration signals as particularly informative for this fault type. For Class 4, although the attention allocated to vibration signals remains predominant, attention given to acoustic signals is relatively elevated, reflecting a more nuanced reliance on both modal features.
Overall, these findings confirm that the employed attention mechanism effectively guides the MTMAGNet model to adaptively prioritize modality-specific information depending on fault characteristics. By dynamically adjusting attention weights, the model achieves optimized multimodal feature integration, thereby significantly enhancing its fault classification performance. This adaptive capability underscores the effectiveness and robustness of our proposed multimodal fusion approach in fault diagnosis tasks.
In order to assess the classification accuracy of the MTMAGNet model under varying levels of noise, noise with different standard deviations was introduced to the original acoustic and vibration signals, as shown in
Figure 8. The
X-axis represents the noise standard deviation, while the
Y-axis shows the corresponding classification accuracy of the model. The results reveal that the MTMAGNet model maintains a high accuracy, close to 100%, when the noise standard deviation is low (0 to 0.5). However, as the noise intensity increases, the model’s accuracy begins to decline noticeably, with a gradual drop observed as the noise standard deviation reaches 1.0. Beyond this threshold, the model’s performance degrades significantly, with accuracy falling below 90% when the noise standard deviation exceeds 2.0. These results indicate that MTMAGNet exhibits strong robustness under low-to-moderate noise levels, whereas its performance degrades noticeably under severe noise. Therefore, the robustness of the proposed method should be understood as conditional rather than uniform across all noise intensities.
In order to visualize the feature embedding space of the Graph Convolutional Network (GCN) at different training stages for our proposed model, the results are shown in
Figure 9. Starting from the first epoch, the node features are relatively dispersed, with nodes from different categories intermingling, showing no clear class boundaries. As training progresses, nodes gradually form clustered distributions, with category boundaries becoming more distinct, and the relationships between nodes gradually strengthen. The model gradually learns to distinguish different feature patterns. At epochs 5 and 10, some categories begin to separate and form initial clusters, indicating that the model has started to identify and differentiate different feature patterns. By epoch 15, the distinctions between categories become more pronounced, and the clustering of nodes significantly improves, indicating that the model has effectively partitioned the feature space. By epoch 20, nodes from different categories form distinct and dense clusters with clear boundaries, demonstrating that the model has achieved strong classification capabilities in the embedding space.
4.3. Comparison with Other Existing Methods
To ensure a fair comparison, all comparative models were trained and evaluated under the same experimental protocol. Specifically, the same train/validation/test splits, signal segmentation strategy, normalization procedure, batch size, optimizer type, training epochs, and evaluation metrics were used for all models. For the deep learning baselines, hyperparameters were selected under the same validation protocol, and no model had access to test-set information during training or model selection. Moreover, for the compared multimodal deep-learning baselines, the same or equivalent 1D-CNN feature-extraction backbone was used whenever applicable, so that performance differences were mainly attributable to the subsequent fusion, sequence modeling, or graph-learning modules rather than to unequal front-end feature-extraction capacity.
We compare the bearing fault classification accuracy of MTMAGNet with five different models to evaluate its performance in intelligent diagnosis. The selected comparison models are: (1) Adaptive Graph Convolutional Network (AGCN [
39]), which dynamically adjusts the graph structure based on data to handle non-static relationships, allowing us to compare AGCN’s adaptability with the multi-task learning and attention mechanisms in MTMAGNet; (2) Dual Graph Convolutional Network (Dual-GCN [
40]), which processes data through two independent graph structures to capture different relationships, serving to evaluate the multi-task learning in MTMAGNet; (3) Graph Attention Network (GAT [
41]), which introduces node-level attention mechanisms to focus on important neighboring nodes, enabling an assessment of modal attention versus node-level attention; (4) Convolutional Neural Network (CNN), where features from acoustic and vibration signals are separately extracted, concatenated, and classified through a fully connected layer to verify MTMAGNet’s superiority over traditional CNNs; and (5) Long Short-Term Memory Network (LSTM), which extracts sequential features suitable for capturing dependencies in acoustic and vibration signals, for comparison with MTMAGNet.
In order to compare fault classification accuracy across eight trials between the MTMAGNet method and five other techniques, the results are shown in
Figure 10. The results indicate that MTMAGNet consistently achieved the highest classification accuracy, approaching or reaching 100%, outperforming all the other methods. Specifically, AGCN and Dual-GCN exhibited certain advantages in adjusting graph structures and facilitating multi-task learning; however, their accuracies were still lower than MTMAGNet, underscoring the effectiveness of the synergistic interaction between multimodal attention mechanisms and multi-task learning in the proposed approach. Compared to GAT, MTMAGNet demonstrated more precise attention allocation at the modality level, enhancing the utilization of features across different modalities. Furthermore, in contrast to traditional CNN and LSTM, MTMAGNet captured acoustic and vibration signal features and their interrelationships more effectively, resulting in higher diagnostic accuracy. In summary, the experimental results validate the superior performance of MTMAGNet in intelligent fault diagnosis, particularly in the areas of multimodal data fusion and optimized attention mechanisms.
Figure 11 summarizes the average classification accuracy of each method, with error bars indicating the accuracy fluctuation range. The results show that the MTMAGNet model achieves the highest average accuracy, approaching 100%, with the smallest error range, demonstrating exceptional stability and robustness. In comparison, AGCN, GAT, and Dual-GCN have slightly lower accuracies, indicating that the multi-modality attention mechanism and multi-task learning in MTMAGNet effectively enhance fault classification performance. Traditional methods like CNN and LSTM perform worse, further validating the superiority of MTMAGNet in intelligent diagnostic tasks.
The confusion matrices for six methods are shown in
Figure 12. The results obtained from the figure are as follows: the MTMAGNet model achieves nearly perfect classification across all fault classes, with no significant misclassifications, highlighting its robustness and accuracy in fault diagnosis. AGCN and Dual-GCN perform relatively well but exhibit occasional misclassifications, particularly in classes 3 and 5. GAT demonstrates competitive performance but shows slightly lower accuracy in class 5. CNN and LSTM, traditional deep learning models, exhibit the highest misclassification rates, especially in classes 2, 3, and 5, indicating limitations in handling complex fault patterns. Overall, the MTMAGNet model surpasses other models in classification accuracy, demonstrating its effectiveness and reliability for intelligent fault diagnosis.
The classification accuracy of six methods under different noise levels is presented, as shown in
Table 8. The results show that MTMAGNET achieves the highest accuracy across all noise levels, especially under low noise conditions (0.2 and 0.5), with accuracy approaching 100%, demonstrating its excellent noise resistance. As the noise level increases, the accuracy of MTMAGNET slightly decreases but still remains superior to other comparative models, maintaining an accuracy of 91.61% even at a noise level of 2. In contrast, Dual-GCN and AGCN exhibit relatively good noise resistance but have significantly lower accuracy than MTMAGNET under high noise conditions (noise level 2). GAT and CNN perform reasonably well at moderate and low noise levels (0.5 and 1), but their accuracy declines markedly under higher noise conditions. LSTM performs the worst across all noise levels, with accuracy dropping significantly to 65.12% at high noise levels. The experimental results further indicate that the MTMAGNET model has high robustness and accuracy under various noise conditions.
To evaluate the classification accuracy of the proposed algorithm under limited small sample conditions, we investigated its feature processing capability at different training sample sizes. The sample sizes of the training dataset were set to 10%, 30%, and 50%.
Table 9 presents the experimental comparison results of six methods under different sample scales. The results in
Table 4 show that MTMAGNet consistently outperforms other methods in classification performance across various training sample proportions, demonstrating exceptional small-sample handling capability. Its average accuracy remains at a high level and shows the least sensitivity to sample size variations. In contrast, graph neural network-based methods such as Dual-GCN and AGCN rank second, exhibiting strong stability and relatively high classification accuracy. However, traditional methods such as CNN and LSTM perform poorly in small-sample scenarios, particularly LSTM, which shows limited improvement in performance as the sample size increases.
Although the above results verify the superiority of MTMAGNet under the conventional experimental setting, further evaluation under different rotational-speed conditions is still necessary to assess its generalization ability in practical applications.
4.4. Generalization Analysis Under Unseen Rotational Speeds
To further verify the robustness and generalization capability of the proposed MTMAGNet under varying operating conditions, additional experiments were conducted at three different rotational speeds, namely 1800 rpm, 2400 rpm, and 3000 rpm. On this basis, two evaluation settings were considered. The first is the same-speed setting, in which the training and testing data were collected under the same rotational speed. The second is the cross-speed setting, in which the training and testing data were collected under different rotational speeds. Compared with the conventional evaluation protocol, the cross-speed setting is more challenging and can better reflect the practical application scenario in which the operating speed of the equipment may vary during deployment.
Table 10 presents the same-speed diagnostic performance of different methods at 1800 rpm, 2400 rpm, and 3000 rpm. It can be observed that all multimodal models achieve relatively high classification accuracy under the same-speed setting. Among them, MTMAGNet consistently exhibits the best performance at all three speeds, reaching 99.82% at 1800 rpm, 99.29% at 2400 rpm, and 100.00% at 3000 rpm, with an average accuracy of 99.70%. In contrast, CNNTransformer achieves average accuracy of 96.61%, LateFusionCNN achieves 95.54%, and CNNBiLSTM achieves 91.79%. These results indicate that the proposed multimodal attention and graph-based learning framework maintains highly stable diagnostic capability under different fixed-speed conditions and outperforms the other compared multimodal baselines in terms of both accuracy and consistency.
From
Figure 13, it is evident that MTMAGNet achieves the best confusion-matrix distribution among all compared methods. Its prediction results are concentrated almost entirely on the main diagonal, indicating excellent classification consistency. In contrast, the other methods show more obvious off-diagonal errors, especially CNNBiLSTM and CNNTransformer, which exhibit substantial confusion among several fault categories. The dominant confusion in MTMAGNet only occurs between Classes C2 and C3, whereas the competing methods show broader and more severe misclassification patterns. These observations demonstrate that the proposed MTMAGNet can more effectively preserve inter-class separability and reduce confusion between fault states with similar signal characteristics.
To further evaluate the model under conditions different from those used for training, cross-speed experiments were conducted, and the corresponding results are summarized in
Table 11. In this setting, the model was trained using data from two rotational speeds and tested on the remaining unseen rotational speed. The results show that the classification performance of all methods decreases compared with the same-speed setting, which confirms that changes in rotational speed introduce a clear distribution shift and increase the difficulty of fault diagnosis. Nevertheless, MTMAGNet still achieves the best performance under all three cross-speed settings, obtaining 74.61%, 67.46%, and 75.36%, respectively, with an average accuracy of 72.48%. By comparison, CNNTransformer achieves an average accuracy of 57.62%, LateFusionCNN achieves 57.24%, and CNNBiLSTM achieves 53.56%. Therefore, although the unseen-speed setting is significantly more challenging than the same-speed setting, MTMAGNet still preserves the highest classification accuracy among all compared methods, which demonstrates its stronger generalization ability.
From the detailed cross-speed results, it can also be found that testing on 2400 rpm is the most difficult setting for all methods. Under this condition, MTMAGNet still achieves 67.46% accuracy, whereas CNNTransformer, LateFusionCNN, and CNNBiLSTM achieve only 38.61%, 52.54%, and 40.79%, respectively. This result suggests that the feature distribution at 2400 rpm differs more substantially from that learned from the other two speeds, leading to more severe domain shift. Even under this challenging condition, the proposed MTMAGNet retains a clear performance advantage, which can be attributed to the synergistic effect of modality attention, graph-based relational modeling, and auxiliary modality supervision.
To provide a more detailed class-level interpretation,
Table 12 reports the class-wise precision, recall, and F1-score of different methods under the representative 3000 rpm setting. Among all compared models, MTMAGNet achieves the most balanced class-level performance, with perfect precision, recall, and F1-score of 1.0000 for Classes C0, C1, C3, C4, C5, and C6, and a near-perfect F1-score of 0.9937 for Class C2. This indicates that the proposed method can stably identify both normal and faulty states under this condition. LateFusionCNN also performs well overall, but its recognition of Class C5 and Class C6 is slightly weaker, with F1-scores of 0.9809 and 0.9816, respectively. CNNBiLSTM shows more pronounced degradation in several classes, especially Class C2, for which the F1-score drops to 0.8944. CNNTransformer performs strongly on most categories, but its performance on Class C1 is clearly lower, with an F1-score of 0.8414, indicating limited robustness for this fault type.
The most confused class pairs are summarized in
Table 13. For MTMAGNet, the dominant confusion occurs between Class C2 and Class C3, with only two samples of Class C2 being misclassified as Class C3 and one sample of Class C3 being misclassified as Class C2, which indicates that the overall confusion level of the proposed method remains very low. By contrast, the competing methods exhibit more severe class confusion. For example, in CNNTransformer, Class C1 is frequently misclassified as Class C2, with 13 confused samples, and as Class C6, with 6 confused samples. In CNNBiLSTM, the most prominent confusion occurs between Class C5 and Class C2, with 6 misclassified samples. These results suggest that the competing methods are more likely to confuse fault categories with similar impulsive characteristics, whereas MTMAGNet can better preserve inter-class separability.
Overall, the above results demonstrate that MTMAGNet not only achieves the highest accuracy under the same-speed setting, but also maintains the best diagnostic performance under the more challenging cross-speed setting. Meanwhile, the class-wise analysis and confusion-pair analysis further confirm that the proposed method provides more balanced and reliable discrimination among different bearing states. Therefore, the proposed MTMAGNet exhibits both superior classification performance and stronger generalization capability than the other multimodal baseline methods.
5. Conclusions
This paper proposed a Multi-Task Multimodal Attention Graph Convolutional Network (MTMAGNet) for acoustic–vibration fusion-based rolling bearing fault diagnosis. By integrating modality-level attention, graph-based relational learning, and an auxiliary modality-classification task, the proposed method effectively exploits complementary multimodal information and improves fault-representation learning. Experimental results demonstrated that MTMAGNet outperformed the compared methods under multiple evaluation settings. In the same-speed experiments, it achieved the best performance at 1800 rpm, 2400 rpm, and 3000 rpm, with an average accuracy of 99.70%. Under the more challenging cross-speed setting, MTMAGNet still obtained the highest average accuracy of 72.48%, indicating stronger generalization ability than the compared multimodal baselines. The ablation, class-wise, and confusion-matrix results further verified the effectiveness of modality attention, graph learning, and auxiliary modality supervision in improving diagnostic performance.
The noise analysis showed that MTMAGNet is robust under low-to-moderate noise levels, but its performance degrades under severe noise. In addition, although the sensitivity analysis indicated that k = 5 provided the best trade-off under the current setting, the proposed framework still depends on the graph-construction strategy. Compared with simpler models, MTMAGNet also introduces additional computational cost due to multimodal attention, dynamic graph construction, and multi-task optimization. Future work will therefore focus on more adaptive graph construction, more efficient lightweight implementation, and broader validation on other rotating machinery and fault types.