A Multi-Task Multimodal Attention Graph Convolutional Network for Acoustic–Vibration Fusion-Based Rolling Bearing Fault Diagnosis

Wang, Tong; Tang, Yuanyuan; He, Yibo; Li, Yinghao

doi:10.3390/app16094310

Open AccessArticle

A Multi-Task Multimodal Attention Graph Convolutional Network for Acoustic–Vibration Fusion-Based Rolling Bearing Fault Diagnosis

¹

Engineering Training Center, Shenyang University of Technology, Shenyang 110870, China

²

School of Mechanical Engineering, Shenyang University of Technology, Shenyang 110870, China

³

School of Information Science and Engineering, Shenyang University of Technology, Shenyang 110870, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(9), 4310; https://doi.org/10.3390/app16094310

Submission received: 26 March 2026 / Revised: 17 April 2026 / Accepted: 24 April 2026 / Published: 28 April 2026

Download

Browse Figures

Versions Notes

Abstract

Single-sensor-based fault diagnosis of rolling bearings often suffers from noise sensitivity, installation-dependent performance, and incomplete fault characterization. To address these limitations, this paper proposes a multi-task multimodal attention graph convolutional network (MTMAGNet) that integrates acoustic and vibration signals for bearing fault diagnosis. First, one-dimensional convolutional neural networks are used to extract modality-specific features. These features are then fused through a multi-modal attention mechanism to exploit the complementary information contained in the two signal sources. Based on the fused representations, a dynamic k-nearest neighbor graph is constructed to model relationships among samples, and a graph convolutional network is employed to learn discriminative structural features. Moreover, a multi-task learning scheme is introduced, in which fault classification serves as the primary task and modal classification is used as an auxiliary task to enhance feature learning and improve model generalization. Experimental results on a self-built acoustic–vibration test bench collected under three rotational speeds (1800 rpm, 2400 rpm, and 3000 rpm) demonstrate that the proposed method achieves high diagnostic accuracy and strong generalization performance under different fault conditions.

Keywords:

fault diagnosis; acoustic–vibration fusion; multimodal attention; multi-task learning; graph convolutional network

1. Introduction

Rolling element bearings are critical transmission components in modern electromechanical systems, and their health condition directly affects equipment reliability, operational safety, and maintenance cost [1,2]. However, because of prolonged operation, variable loads, and harsh industrial environments, bearings are highly susceptible to faults such as wear, fatigue, and spalling. Once such faults develop, they may cause performance degradation, machine downtime, and even serious safety accidents. Therefore, timely and accurate bearing fault diagnosis remains an important topic in condition monitoring and intelligent maintenance.

Traditional rolling bearing fault diagnosis methods mainly rely on signal analysis in the time domain, frequency domain, and time–frequency domain [3,4,5,6]. Time-domain methods usually extract statistical indicators, such as root mean square, kurtosis, skewness, crest factor, and impulse factor, to characterize fault-related impulsiveness and amplitude variation [3]. Frequency-domain approaches focus on characteristic fault frequencies and their harmonics through spectral analysis, envelope demodulation, or spectral kurtosis [4]. Time–frequency methods, such as short-time Fourier transform, wavelet transform, empirical mode decomposition, and variational mode decomposition, are particularly suitable for nonlinear and non-stationary bearing signals [5,6]. Although these methods have achieved useful results, they often depend on handcrafted features and expert knowledge, and their performance may degrade under strong noise, sensor variability, and complex operating conditions.

Among the available sensing modalities, vibration signals are the most widely used in bearing diagnosis because they directly reflect the dynamic responses of mechanical structures [7,8,9,10,11,12]. For example, Wang et al. [7] combined improved multiscale weighted permutation entropy with an optimized support vector machine to improve diagnosis robustness under complex conditions. Ni et al. [9] proposed a fault-information-guided variational mode decomposition method to extract intrinsic mode functions associated with bearing defects more effectively. In the deep-learning domain, Zhao et al. [11] developed a variable-speed diagnosis framework based on a deep branched attention network and multiscale entropy, while Li et al. [12] introduced a semi-supervised graph convolutional network using images transformed from vibration signals. These studies demonstrate the effectiveness of vibration-based diagnosis, but vibration measurement usually requires contact sensors and may be affected by installation position and environmental interference.

In contrast, acoustic signals provide several attractive advantages, including non-contact measurement, flexible deployment, and sensitivity to certain early-stage fault signatures [13,14,15,16,17,18]. Shiri et al. [15] used acoustic sensing in a robotic platform to monitor conveyor belt rollers and improved the safety and effectiveness of fault detection. Hou et al. [16] investigated acoustic-emission-based diagnosis for high-speed train wheelset bearings and established a mathematical model considering elastic distortion and signal attenuation. Sun et al. [17] employed acoustic diagnosis with multiscale fractional permutation entropy in railway turnout machines, and Tang et al. [18] achieved high-accuracy gearbox fault diagnosis through joint use of thermal imaging and acoustic information. Nevertheless, acoustic signals are more vulnerable to environmental noise and often have a lower signal-to-noise ratio than vibration signals. For this reason, relying on a single sensing modality may not be sufficient to fully describe complex bearing health states.

To overcome the limitations of single-sensor diagnosis, multimodal fusion of acoustic and vibration signals has attracted increasing attention [19,20,21,22,23,24,25,26,27,28,29,30]. Because these two modalities are complementary in physical meaning, frequency sensitivity, and acquisition characteristics, their fusion can provide a more comprehensive representation of bearing fault information [19,20]. Yan et al. [21] proposed a fusion method based on a secondary convolutional neural network and reported improved diagnosis accuracy. Tang et al. [22] showed that the combination of vibration and acoustic-emission signals can more comprehensively reflect bearing operating conditions. Buchaiah et al. [23] introduced a feature extraction and selection framework for multimodal data fusion to improve predictive accuracy and stability. Saufi et al. [24] fused raw signals from different sensors through a DE-1D-CNN. GUO et al. [25] designed a fault diagnosis method combining feature extraction and word bag model. Pacheco-Cherrez et al. [26] provided a new versatile approach to enhance predictive maintenance methodology in rotating machinery using vibration and acoustic signals. Chu et al. [27] and Liu et al. [28] further extended multimodal diagnosis toward cross-domain and domain-adaptive settings. Li et al. [29] and Li et al. [30] also improved classification accuracy by effectively integrating acoustic and vibration signals. Despite this progress, many existing fusion methods still rely on direct feature concatenation, manually designed fusion strategies, or modality-agnostic integration, which may not fully exploit modality complementarity at the representation-learning level.

In parallel, graph neural networks (GNNs) have shown strong capability in modeling non-Euclidean relationships and high-order dependencies between samples [31,32,33,34,35,36]. By representing each sample as a node and its similarity relations as edges, GNNs can capture structural information that is difficult to exploit using conventional sample-independent networks. Liu et al. [32] introduced a heterogeneous GNN model to capture complex relational patterns, while Yong et al. [33] proposed a bearing diagnosis method combining dual-path CNNs and multiple graph convolutional networks. Liu et al. [34] developed a dynamic temporal GNN for multivariate time-series classification, and Kavianpour et al. [35] and Ghorvei et al. [36] further demonstrated the value of graph-based representation learning under domain adaptation and varying working conditions. These studies indicate that graph learning can enhance diagnosis performance by exploiting inter-sample structural relations. However, in multimodal bearing diagnosis, two challenges remain insufficiently addressed: first, how to adaptively weight different sensing modalities before fusion; and second, how to preserve modality-discriminative characteristics while performing graph-based relational learning. Ait Ichou et al. [37] proposed a deep multimodal learning framework for heart sound classification by integrating CNNs, Transformers, and BiLSTM with attention, showing that multimodal feature fusion and hybrid sequence modeling can significantly improve classification robustness in noisy and non-stationary acoustic signals.

To better position this study with respect to prior acoustic–vibration fusion and graph-based diagnosis methods, the novelty of the proposed MTMAGNet does not lie in the isolated use of CNNs, attention, GCNs, or multi-task learning, since each of these components has already been established in previous studies. Instead, the key novelty lies in how these components are organized into a unified multimodal relational-learning framework. Specifically, unlike existing acoustic–vibration fusion approaches that mainly rely on direct concatenation or decision-level fusion, the proposed method introduces a modality-level attention mechanism to adaptively reweight acoustic and vibration embeddings before fusion. Unlike conventional graph-based diagnosis methods that usually construct graphs from single-modal features, MTMAGNet performs graph learning on the fused multimodal representation, so that inter-sample relations are established after cross-modal interaction. In addition, an auxiliary modality-classification task is introduced to preserve modality-discriminative characteristics during feature learning, thereby improving the quality of the fused representation and enhancing generalization.

Accordingly, this paper proposes a Multi-Task Multimodal Attention Graph Convolutional Network (MTMAGNet) for acoustic–vibration fusion-based rolling bearing fault diagnosis. The main contributions of this work are summarized as follows:

(1): A modality-attention-based acoustic–vibration fusion strategy is proposed to adaptively exploit the complementary information of the two sensing modalities;
(2): A k-nearest-neighbor graph is constructed on the fused multimodal embeddings so that inter-sample structural relations can be learned by a graph convolutional network;
(3): A multi-task learning scheme is introduced, in which fault classification is treated as the main task and modality classification is used as an auxiliary task to regularize representation learning and improve generalization.

The remainder of this paper is organized as follows: Section 2 introduces related work, covering theories on multimodal fusion, graph neural networks, and multi-task learning, and details the proposed method, including its structure, algorithm implementation, and technical specifics. Section 3 presents the experimental details. Section 4 provides experimental validation and result analysis, demonstrating the effectiveness and superiority of the proposed method compared to existing approaches. Section 5 summarizes the main contributions and discusses the method’s limitations and potential future research directions.

2. Proposed Method

This section provides a detailed description of the MTMAGNet method, which serves as the core component of the proposed multi-modality diagnostic framework. Following this, the internal mechanism of the modality attention is explained, with a focus on how adaptive weights are assigned to acoustic and vibration signals. The specific implementation of the MTMAGNet model, including graph convolutional layers, feature fusion strategy, and optimization approach, is also presented.

2.1. Feature Extraction from Acoustic and Vibration Data

Assume that the acoustic and vibration signals are discrete time series with a sampling frequency of fs. The acoustic and vibration signal values at each discrete time point n are denoted as X_a[n] and X_v[n], respectively, where n = 1, 2, …, N, and there are a total of N sampling points. The discrete acoustic signal can be represented as:

X_{a} [n] = \{X_{a} [1], X_{a} [2], \dots, X_{a} [N]\}

(1)

Similarly, the discrete vibration signal can be represented as:

X_{v} [n] = \{X_{v} [1], X_{v} [2], \dots, X_{v} [N]\}

(2)

In these equations,

X_{a} [n] \in R

represents the acoustic signal value at the n-th synchronized time point, and

X_{v} [n] \in R

represents the vibration signal value at the same time point.

The core of feature extraction lies in transforming the input time series signals X_a and X_v into corresponding low-dimensional feature vectors ha and hv. In this paper, we use convolutional neural networks (CNNs) to extract local features while reducing dimensionality. The convolution operation is a key module for capturing local patterns from input signals. The convolution can be expressed mathematically as follows:

h^{(l)} (t) = σ (\sum_{i = 1}^{k} ω_{i}^{(l)} • X (t + i - 1) + b^{(l)})

(3)

where

h^{(l)} (t)

is the synchronized output of the l-th convolutional layer at time t.

ω_{i}^{(l)}

represents the convolution kernel weights of size k, where k is the window size used to calculate local statistics over the temporal dimension. b(l) is the bias term. The

σ (•)

function refers to the activation function. By using this one-dimensional convolution, we can capture local temporal relationships and dependencies from the time-domain signal.

Pooling layers are introduced to reduce the temporal dimensionality while preserving key features. A common pooling strategy is Max Pooling, whose mathematical formula is:

h_{p o o l}^{(l)} (t) = \max (h^{(l)} (t : t + s - 1))

(4)

where

h_{p o o l}^{(l)} (t)

represents the pooled features, and s is the size of the pooling window. By applying pooling, we can reduce the temporal dimensionality while retaining the most salient local features, which is beneficial for improving the robustness of the extracted features.

Specifically, acoustic and vibration signals are processed through multiple layers of one-dimensional convolutions to extract local patterns, followed by down-sampling through pooling layers. The final output is then projected into low-dimensional feature spaces through fully connected layers. The entire feature extraction process can be mathematically described as:

h_{a} = f (X_{a}) = P o o l (σ (C o n v (X_{a}))) \in R^{d}

(5)

h_{v} = f (X_{v}) = P o o l (σ (C o n v (X_{v}))) \in R^{d}

(6)

where Conv(•) denotes the convolution operation, σ(•) is the activation function, and Pool(•) represents the pooling operation. The output is a d-dimensional feature vector.

Through convolution operations, the model can capture local signal patterns, such as peaks and periodic characteristics, which helps the model to reduce the impact of noise. Pooling layers help reduce the data dimensionality, avoiding overfitting while preserving key information. By passing through multiple convolutional and pooling layers, the raw time-series signals are compressed into low-dimensional feature vectors, which can be more effectively fused in the subsequent multi-modality integration and classification tasks. Although the acoustic and vibration branches adopt the same feature-extraction architecture, their parameters are not shared. This design is necessary because the acoustic branch takes a 16-channel input, whereas the vibration branch takes a single-channel input. Therefore, the two branches have the same layer configuration but are independently parameterized to better adapt to the different statistical characteristics of the two sensing modalities.

2.2. Modality Attention Mechanism

The modality attention mechanism calculates attention scores sa and sv for the acoustic and vibration modalities, respectively, to reflect the relative importance of each modality. The attention scores are computed using a nonlinear activation function, tanh, to capture complex inter-modal relationships. The attention score for the acoustic modality is defined as follows:

s_{a} = w^{T} \tanh (W_{a} h_{a} + b_{a})

(7)

where

h_{a} \in R^{d}

represents the feature vector extracted from the acoustic signal, which is the output from the feature extractor.

W_{a} \in R^{k \times d}

is a learnable weight matrix that maps acoustic features to a hidden space, with a dimension of d.

b_{a} \in R^{k}

is the bias term, and

w \in R^{k}

is the learnable attention vector, which transforms the activated features into a scalar attention score.

Similarly, the attention score for the vibration modality is defined as:

s_{v} = w^{T} \tanh (W_{r} h_{v} + b_{r})

(8)

In a similar fashion,

h_{v} \in R^{d}

represents the feature vector extracted from the vibration signal.

b_{r} \in R^{k}

is the bias term, and

w \in R^{k}

is the attention vector shared with the acoustic modality. After obtaining the attention scores for both modalities, softmax normalization is applied to compute the attention weights

α_{a}

and

α_{v}

, representing the relative importance of each modality during the fusion process. The attention weights for acoustic and vibration modalities are defined as follows:

α_{a} = \frac{\exp (s_{a})}{\exp (s_{a}) + \exp (s_{v})}

(9)

α_{v} = \frac{\exp (s_{v})}{\exp (s_{a}) + \exp (s_{v})}

(10)

From Equations (9) and (10), it can be observed that the sum of the attention weights for both modalities equals 1, ensuring that the modalities are fused in a weighted manner.

After computing the attention weights for the acoustic and vibration modalities, the final fused feature vector is obtained by combining the two modality features in a weighted sum:

h_{f u s e d} = α_{a} \cdot h_{a} + α_{v} \cdot h_{v}

(11)

where

h_{f u s e d} \in R^{d}

is the fused feature vector.

α_{a}

and

α_{v}

represent the attention weights for each modality, which are applied to the acoustic and vibration feature vectors, respectively. The fused feature vector hfused combines information from both modalities, reflecting the varying importance of each modality’s features, thus providing a rich representation for subsequent classification tasks.

In this work, a lightweight additive attention formulation is adopted for modality weighting. Compared with dot-product attention, this design is more suitable for the present two-modality fusion scenario because the acoustic and vibration features are heterogeneous in sensing characteristics and may not be directly comparable through inner-product similarity alone. The nonlinear projection with tanh allows each modality to be mapped into a shared latent space before scoring, while the learnable vector provides a simple and stable mechanism for estimating modality importance with low computational overhead. Since the purpose here is to assign a global fusion weight to each modality rather than to model complex token-wise interactions, the adopted formulation is sufficient and computationally efficient.

2.3. Graph Construction

To capture structural relationships among samples, a k-nearest-neighbor (kNN) graph is constructed in the fused multimodal feature space, where edges connect samples with high representational similarity. This graph-based structure enables the subsequent GCN to exploit inter-sample relational information beyond individual feature extraction. To ensure the graph remains consistent with evolving representations during training, the kNN graph is dynamically reconstructed at each epoch based on the latest feature embeddings, rather than being fixed at initialization. This dynamic update strategy prevents the adjacency structure from becoming misaligned with the learned feature space, thereby improving the effectiveness of graph-based feature propagation.

The feature matrix H, composed of fused feature vectors

h_{f u s e d}^{(i)}

is represented as follows:

H = {[h_{f u s e d}^{(1)}, h_{f u s e d}^{(2)}, \dots, h_{f u s e d}^{(N)}]}^{T} \in R^{N \times d}

(12)

where N denotes the total number of samples, and d represents the feature dimension for each sample. To construct the KNN graph, the Euclidean distance is used to measure the similarity between samples, thereby determining the connections between nodes. For each node, the K nearest neighbors are selected, and an adjacency matrix is constructed based on the similarity between samples. Each edge in the adjacency matrix represents the similarity between the connected samples.

After graph construction, the fused multimodal features are processed by a graph convolutional network (GCN). The graph propagation rule adopted in this work follows the classical symmetrically normalized GCN formulation proposed by Kipf and Welling [38].

For each node i, a linear transformation is applied to project the original feature vector into a new feature space:

{h^{'}}_{i} = W h_{i}

(13)

where

h_{i} \in R^{d}

is the original feature vector of node i,

W \in R^{d^{'} \times d}

is a learnable weight matrix, and

{h^{'}}_{i} \in R^{d}

is the transformed feature vector. For node i, the aggregated feature representation

h_{i}^{o u t}

is defined as:

h_{i}^{o u t} = σ (\sum_{j \in N_{i}} \frac{1}{\sqrt{\deg (i)} \sqrt{\deg (j)}} W h_{j})

(14)

where

N_{i}

denotes the set of neighboring nodes of node i, and deg(i) and deg(j) represent the degrees of nodes i and j, respectively. W is the weight matrix shared across the layer.

σ (\cdot)

is a nonlinear activation function.

The normalized adjacency matrix is defined as follows:

\tilde{A} = D^{- \frac{1}{2}} A D^{- \frac{1}{2}}

(15)

where A is the adjacency matrix of the graph. D is the degree matrix, and

D_{i i} = \deg (i)

.

After aggregating information from neighboring nodes, the output layer of GCN uses a classifier to predict the class of each node. The classification expression is as follows:

z_{i} = s o f t \max (W_{c l a s s} h_{i}^{o u t} + b_{c l a s s})

(16)

where

W_{c l a s s} \in R^{C \times d^{'}}

is the weight matrix of the classifier,

b_{c l a s s} \in R^{C}

is the bias term of the classifier, and C is the number of classes. The construction and iteration process of GCN is shown in Figure 1.

2.4. Multi-Task Learning

In the multi-task learning framework, the model is required to simultaneously learn both the primary and auxiliary tasks. The primary task is fault classification, which predicts the class label for each sample. The auxiliary task is modality classification, which is applied to the modality-specific feature vectors h_a and h_v before fusion. The purpose of this auxiliary task is to preserve modality-discriminative information during representation learning, thereby improving the quality of multimodal fusion and enhancing generalization.

\begin{matrix} p_{a} = softmax (W_{\mod} h_{a} + b_{\mod}) \\ p_{v} = softmax (W_{\mod} h_{v} + b_{\mod}) \end{matrix}

(17)

In this equation,

W_{\mod} \in R^{2 \times d}

is the weight matrix of the modality classifier, which projects the fused feature onto a two-dimensional probability space, representing the acoustic and vibration modalities.

b_{\mod} \in R^{2}

is the bias term of the modality classifier. To train the modality classifier, the true labels y_modality for each modality are defined with a one-hot encoding as follows:

For acoustic features:

y_{\mod a l i t y} = [1, 0]

(18)

For vibration features:

y_{\mod a l i t y} = [0, 1]

(19)

Through the modality classification task, the model can better distinguish features originating from different modalities, and this supervised learning helps to improve the quality of the fused features.

2.5. Loss Function Construction

In the multi-task learning process, the model’s loss function consists of the primary classification task loss and the auxiliary modality classification task loss. To enable the model to achieve optimal performance on both tasks, the final loss is a weighted sum of the primary and auxiliary task losses.

The loss function for the primary classification task employs cross-entropy loss, which measures the difference between the predicted and true class labels. The classification task loss is defined as:

L_{c l a s s i f i c a t i o n} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{c l a s s, i}^{(c)} \log z_{i}^{(c)}

(20)

where N is the number of samples, C is the number of classes,

y_{c l a s s, i}^{(c)}

represents the true label of sample i for class j, and

z_{i}^{(c)}

is the predicted probability of the model that sample i belongs to class j. Cross-entropy loss calculates the divergence between the true label and the predicted probability, thus evaluating the classification accuracy of the model. If the model’s predicted probability closely matches the true label, the loss will be small; otherwise, it will be larger.

The modality classification task balances the accuracy of the model’s modality classification, and its loss function is defined as follows:

L_{\mod a l i t y} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{2} y_{\mod a l i t y, i}^{(m)} \log p_{\mod a l i t y, i}^{(m)}

(21)

where

y_{\mod a l i t y, i}^{(m)}

is the modality label for sample i, represented as a one-hot encoded vector. For acoustic features, the label is [1, 0]; for vibration features, the label is [0, 1].

p_{\mod a l i t y, i}^{(m)}

is the model’s predicted probability. The modality classification loss ensures the model can correctly distinguish between the sources of fused features.

The total loss is a weighted sum of the classification and modality classification losses, and is expressed as:

L_{t o t a l} = λ_{c l a s s i f i c a t i o n} L_{c l a s s i f i c a t i o n} + λ_{\mod a l i t y} L_{\mod a l i t y}

(22)

where λ_modality and λ_modality are weight parameters. These weights adjust the importance of the primary and auxiliary tasks in the total loss function. Through this approach, the model can jointly optimize the primary classification task and the auxiliary modality classification task, allowing for flexible adjustment of task weights according to specific application scenarios and enhancing the model’s performance. Accordingly, the auxiliary loss λ_modality is used to regularize the shared representation learning process, while λ_modality remains the primary optimization target for fault diagnosis.

The Adam optimizer is ultimately used to optimize all parameters of the model. Algorithm 1 provides the implementation process for fault diagnosis in the MTMAGNet data fusion model.

Algorithm 1: MTMAGNet Fusion Algorithm for Intelligent Fault Diagnosis

Require: Acoustic signal dataset X_a, Vibration signal dataset X_v

Ensure: Classification results of the fused signals

1: for each data file i = 1 to N do
2: Read the acoustic signal

X_{a}^{(i)}

and the vibration signal

X_{v}^{(i)}

3: Obtain the corresponding class label

y^{(i)}

4: end for
5: Perform normalization on

X_{a}

and

X_{v}

to obtain the normalized signals

{\tilde{X}}_{a}

and

{\tilde{X}}_{v}

6: Define the FeatureExtractor f_FE
7: Define the ModalityAttention
8: DefCompute attention weights via softmaxine the GCN
9: Define the modality classifier MultiTaskModel
10: Initialize the model parameters and the optimizer
11: for epoch from 1 to num_epochs do
12: for each training batch do
13: Define modality labels: Acoustic modality label y_a = 0, Vibration modality label y_a = 1
14: Input

{\tilde{X}}_{a}

and

{\tilde{X}}_{v}

to the FeatureExtractor to obtain feature vectors:

h_{a} = f_{F E}^{(a)} ({\tilde{X}}_{a})

,

h_{v} = f_{F E}^{(v)} ({\tilde{X}}_{v})

15: Compute attention weights using ModalityAttention:

s_{a} = w^{Τ} \tanh (W_{a} h_{a})

s_{v} = w^{T} \tanh (W_{r} h_{v})

Compute attention weights via softmax:

α_{a} = \frac{\exp (S_{a})}{\exp (S_{a}) + \exp (S_{v})}

α_{v} = \frac{\exp (S_{v})}{\exp (S_{a}) + \exp (S_{v})}

16: Fuse features to obtain the fused feature vector:

h_{f u s e d} = α_{a} \cdot h_{a} + α_{v} \cdot h_{v}

17: Construct a KNN graph based on h_fused to obtain the adjacency matrix or edge_index
18: Input h_fusedand edge_index into the GCN to obtain classification logits:

logits = G C N (h_{f u s e d}, e d g e_i n d e x)

19: Compute modality logits for acoustic and vibration modalities using the modality classifier:

{modality_logits}_{a} = ModalityClassifier (h_{a})

,

modality_\log {its}_{v} = ModalityClassifier (h_{v})

20: Compute the main task classification loss and auxiliary modality classification loss:
Main task loss:

L_{c l a s s i f i c a t i o n} = CrossEntropyLoss (logits, y)

, Modality classification loss:

L_{modality} = \frac{1}{2} {{CrossEntropyLoss (modality_logits}_{a}, y_{a} {) + CrossEntropyLoss (modality_logits}_{v}, y_{v})}

21: Compute the total loss:

L_{t o t a l} = λ_{c l a s s i f i c a t i o n} L_{c l a s s i f i c a t i o n} + λ_{\mod a l i t y} L_{\mod a l i t y}

22: Perform backpropagation and update model parameters using the optimizer
23: end for
24: Update the learning rate scheduler
25: end for
26: for each testing batch do
27: Repeat steps 13 to 20 from the training phase without performing backpropagation
28: end for

2.6. MTMAGNet Intelligent Diagnosis Framework

The proposed MTMAGNet method is an advanced intelligent fault diagnosis framework that extracts signal features using convolutional networks, fuses these features via the MMA, models sample relationships with GCN, and optimizes classification tasks through multi-task learning. Figure 2 provides an overview of the framework, which consists of the following four key steps:

Step 1: Acquire acoustic and vibration signals from rolling bearings of mechanical equipment operating under various conditions. These signals serve as the primary input for fault diagnosis, capturing crucial information about the equipment’s operational state.

Step 2: Use CNN to extract deep features from the acquired acoustic and vibration signals. Then, apply the MMA to fuse these features. Attention weights are computed to highlight the most relevant features from each modality, resulting in fused feature representations.

Step 3: Construct a graph representation of the data using the fused feature vectors, where nodes represent the samples and edges are formed based on their similarity. The graph structure is created using the KNN algorithm. The fused features and graph are then passed to the GCN, which captures both local and global relationships among the samples, yielding enhanced feature representations.

Step 4: The GCN outputs are input into a MTL for fault classification, which includes both the main task and the auxiliary task. A total loss function is defined, and the model is trained by minimizing this loss, enabling accurate fault diagnosis based on the fused multi-modal signals.

3. Experiment

In order to verify the effectiveness of the algorithm proposed in this paper, we set up a rolling bearing acoustic and vibration test bench as shown in Figure 3. The experimental system mainly comprises a motor, controller, shaft, test bearing, acoustic array sensors, and data acquisition system. The acoustic array sensor consists of 16 acoustic sensors of model BSWA MPA416, each with a sensitivity of 50 mV/Pa. The acoustic subsystem consists of a 16-channel microphone array. During data acquisition, the 16 microphone channels are synchronously sampled and stored as multichannel acoustic recordings. For each sample, temporally aligned 1024-point segments are extracted simultaneously from all 16 channels and stacked to form a 16-channel acoustic input tensor.

These sensors are arranged in a circular array mounted on a disk, forming an acoustic array board. To ensure signal quality and reduce signal attenuation, the acoustic array board is installed on a plane 200 mm away from the end of the bearing under test, with its center aligned with the bearing axis. During the operation of the bearing, the acoustic array sensors collect acoustic signals and transmit them to a computer terminal via a PAK MKII-SC42 data acquisition system, which converts the analog signals into digital signals. To meet the requirements of the sampling theorem, the sampling frequency of the acoustic signals is set to 16,384 Hz. A vibration sensor is installed on the bearing housing of the bearing under test to collect vibration signals. These signals are collected using a BSZ800D-16 data acquisition device, with the sampling frequency also set to 16,384 Hz. The vibration data acquisition device is connected to the computer terminal and is used alongside the acoustic signals for subsequent data analysis. This experiment investigates faults of the outer race, inner race, and rolling elements of rolling bearings under a fixed rotational speed of 2400 rpm. The geometric parameters of the bearing are shown in Table 1.

Figure 4 presents a set of time-domain signals of vibration and acoustical data collected from the test bench for seven different fault types. Each subfigure separately shows the amplitude variations in vibration and acoustical recordings over time under specific fault conditions. In the left column of each figure, the vibration signals exhibit a wide range of amplitude fluctuations across various fault conditions, indicating potential differences in energy distribution and fault impact intensity. The right column displays the acoustical signals, whose amplitude variations are consistent with the fault characteristics depicted by the vibration signals. These signals provide foundational insights into the characteristics of each fault type, which will be utilized in the feature extraction and classification stages of the proposed multi-modal diagnostic framework.

To construct the bearing fault dataset, experiments were conducted at a fixed working speed of 2400 rpm. Considering different fault types, we constructed seven different operating conditions for the rolling bearing, including two normal operating conditions and six fault conditions such as outer race faults, inner race faults, rolling element faults, and some coupled faults. Each condition was assigned a label from C0 to C6 to represent different bearing states. For both acoustic and vibration signals, 400 samples were collected for each state, totaling 2800 samples per type of signal, with 20% used for validation. Detailed information is listed in Table 2.

The proposed model is implemented using Python 3.10 and Torch-gpu 1.11.0, taking advantage of the computational capabilities provided by an NVIDIA GeForce GTX 3070 GPU, paired with an 11th Gen Intel(R) Core(TM) i7-11700K processor running at 3.60 GHz, and 48 GB of RAM. The network architecture and parameter configuration are detailed in Table 3, which describes the structural components of each layer in the model. The FeatureExtractor module consists of three 1D convolutional layers, with kernel sizes of 5, 5, and 3, respectively, and increasing input sizes of 1, 64, and 128, resulting in output sizes of 64, 128, and 256. A dropout layer with a probability of 0.5 is incorporated after the convolutional layers to mitigate overfitting, followed by a fully connected layer that outputs a feature size of 128. The ModalityAttention module includes three linear layers with an input size of 128 for the first two layers and 64 for the third, transforming the features to facilitate multi-modality fusion. In the GraphLayer, two GCNConv layers are employed with output sizes of 64 and 7, respectively, to extract graph-based features. The final ModalClassifier comprises a fully connected layer with an output size of 2 to distinguish between different modalities.

The hyper-parameter settings for the model are summarized in Table 4. The batch size for both training and testing is set to 64, with optimization handled by the Adam optimizer. A learning rate of 0.0001 and a weight decay of 1 × 10⁻⁴ are used. The lr scheduler is adopted for learning-rate decay, with step_size = 50 and gamma = 0.5. The number of epochs is fixed at 100. Additionally, the hyper-parameters include λ_class = 1.0 and λ_modality = 0.4 to balance the loss functions of the classification and modality tasks. The window size for feature extraction is set to 1024, and the number of nearest neighbors nneighbors is 5, with each file containing 400 samples.

To avoid information leakage between training, validation, and testing, the data split was performed before graph construction, and each split was processed independently throughout the subsequent pipeline. Specifically, acoustic and vibration recordings belonging to different operating conditions were first assigned to the training, validation, and testing subsets. Signal segmentation was then conducted separately within each subset, and no segmented sample derived from one original recording was allowed to appear in more than one subset. In addition, the k-nearest-neighbor graph was constructed independently for each subset based only on the samples within that subset. Therefore, no test sample participated in the graph construction, feature propagation, or hyperparameter selection during training. All normalization operations were also performed separately after the data split, so that no statistical information from the validation or test sets was used in model training.

The number of neighbors k was selected based on validation performance. We evaluated k ∈ {3, 5, 7, 9, 11} and found that k = 5 achieved the best trade-off between local structural preservation and classification stability. A too-small k led to insufficient neighborhood information, whereas a too-large k introduced redundant inter-class connections. As shown in Table 5, the model achieved the best validation/test performance at k = 5; therefore, this value was adopted in all subsequent experiments.

4. Diagnostic Analysis

4.1. Experiment Result

To evaluate the impact of each component within the proposed MTMAGNet model on overall performance, we conducted ablation experiments by individually removing or simplifying key modules and observed the changes in test accuracy over training iterations. The methods for each ablated module are described as follows: (a) to verify the effectiveness of the multimodal information fusion method, we first assessed the fault diagnosis performance using single-modal data, employing only the combination of vibration signals and the GCN model, denoted as Only Vibration with GCN (OV-GCN); (b) using only the combination of acoustic signals and the GCN model, denoted as Only Acoustic with GCN (OA-GCN); (c) fusing acoustic and vibration signals but removing the modal attention mechanism while retaining the multi-task learning module, denoted as MTL-GCN; (d) fusing acoustic and vibration signals but removing the multi-task learning module while retaining the modal attention mechanism, denoted as MMA-GCN; and (e) fusing acoustic and vibration signals but simultaneously removing both the modal attention mechanism and the multi-task learning module, retaining only the GCN model, referred to as VA-GCN. The results of the ablation experiments are shown in Figure 5.

From Figure 5, it can be observed that: The complete MTMAGNet model maintained a test accuracy close to 100% throughout the training process, outperforming all other configurations. The single-modal models OV-GCN and OA-GCN exhibited lower performance, especially OA-GCN, which showed significant fluctuations in accuracy and was notably inferior to the other models. The MMA-GCN and MTL-GCN models, which retained either the modal attention mechanism or the multi-task learning module, showed significant improvements in test accuracy, reaching approximately 95%. This indicates that each module plays a crucial role in multimodal data processing and feature fusion. The VA-GCN model, which simultaneously removed both modules, performed worse than the complete model, confirming the necessity of the modal attention mechanism and multi-task learning in enhancing classification accuracy. The ablation experiment results clearly demonstrate the unique contributions of each module; the synergistic effect of the multimodal attention mechanism and multi-task learning enables the MTMAGNET model to exhibit outstanding performance in multimodal fault diagnosis tasks.

To further verify the effectiveness of the MTMAGNet model under multi-modal fusion, Table 6 summarizes the test results from eight ablation experiments. These results showcase the performance of different models in terms of maximum test accuracy (Max-acc), minimum test accuracy (Min-acc), average test accuracy (Average-acc), and the standard error of the average accuracy. This data provides quantitative evidence for assessing the impact of each model component on overall performance. From the results, it is evident that the complete MTMAGNet model performed exceptionally well across all experiments, achieving a maximum test accuracy of 100%, a minimum of 99.11%, an average accuracy of 99.78%, and a standard error as low as 0.31%. This demonstrates that the collaborative effect of multi-modal fusion, the modal attention mechanism, and multi-task learning enables the model to attain extremely high classification accuracy and stability. In contrast, the MMA-GCN model—saw its average accuracy decrease to 96.55%, with the standard error increasing to 0.41%. This indicates that multi-task learning plays a significant role in enhancing feature extraction efficiency and the model’s generalization ability. The MTL-GCN model, which lacks the modal attention mechanism, achieved an average accuracy of 94.90% with a standard error of 0.79%, highlighting the crucial role of the modal attention mechanism in effectively fusing multi-modal features. The VA-GCN model, retaining only the GCN component, had an average accuracy of 94.35% and a standard error of 0.81%, significantly lower than the complete model. This underscores the necessity of both the modal attention mechanism and multi-task learning. The single-modal models, OV-GCN and OA-GCN, achieved average accuracies of 91.81% and 86.16%, with standard errors of 0.94% and 1.39%, respectively, performing the worst among all models. This further confirms that single-modal data cannot fully exploit the model’s potential, emphasizing the importance of multi-modal fusion. From the ablation results, it can be observed that introducing the auxiliary modality-classification task improves rather than degrades the performance of the full model. In particular, the complete MTMAGNet achieves higher average accuracy than the variant without multi-task learning, indicating that the auxiliary loss provides effective regularization rather than harmful negative transfer under the present setting.

4.2. Model Performance Analysis

To comprehensively evaluate the sensitivity of the MTMAGNet model to hyperparameter configurations, Figure 6 depicts the relationship between diagnostic accuracy and the hyperparameters Lambda Modality and Lambda Class in a three-dimensional surface plot. In this visualization, the X-axis and Y-axis correspond to Lambda Modality and Lambda Class values, respectively, while the Z-axis represents the average diagnostic accuracy achieved by the model. From the figure, it is evident that the model reaches peak diagnostic accuracy within the moderate hyperparameter range of approximately 0.4 to 0.6 for both Lambda Modality and Lambda Class. This region forms a plateau, indicating stable and consistently high model performance. Such optimal performance confirms that appropriately balanced task weights allow the model to exploit the complementary information from both tasks effectively. Conversely, when Lambda Modality and Lambda Class are set toward extreme values, a clear decrease in accuracy accompanied by greater fluctuation emerges. Particularly noticeable is the sharp performance drop when both hyperparameters approach higher values, suggesting that excessive emphasis on either modality attention or classification tasks can negatively impact the model’s generalization ability. This phenomenon likely results from an imbalance that induces overfitting to one particular task, limiting the model’s capacity to harness multi-task synergies.

To further substantiate the selection of the loss weights, a systematic grid search was conducted over Lambda Modality and Lambda Class. Specifically, Lambda Class was varied in {0.2, 0.4, 0.6, 0.8, 1.0}, and Lambda Modality was varied in {0.1, 0.2, 0.4, 0.6, 0.8}. For each parameter pair, the model was trained and evaluated under the same experimental protocol, and the corresponding average accuracy and Macro-F1 were recorded. The results are summarized in Table 7. It can be observed that the performance varies systematically with the two loss-weight parameters, and the combination Lambda Modality = 0.4 and Lambda Class = 1.0 achieves the best overall trade-off among the tested settings. This result is consistent with the trend shown in Figure 6, where the model exhibits the most stable and highest performance in the moderate parameter region. Therefore, they were adopted in all subsequent experiments.

In order to further analyze the model’s attention allocation to acoustic and vibration features across different fault categories, the mean attention weights extracted by the MTMAGNet model on the test set are depicted in Figure 7. In this figure, the horizontal axis represents different fault categories (Class 0 to Class 6), and the vertical axis denotes the mean attention weight. The yellow bars represent the mean attention weights allocated to acoustic signals, while the orange bars illustrate those for vibration signals, with the error bars indicating standard deviations. The results clearly reveal varying patterns of attention allocation between the acoustic and vibration modalities across different fault categories. For instance, in Class 0 and Class 2, attention weights for acoustic and vibration signals are relatively balanced, indicating that the model equally considers both modalities when extracting discriminative features. In contrast, for Class 1, Class 3, Class 5, and Class 6, attention weights assigned to vibration signals are significantly higher than those assigned to acoustic signals, suggesting a stronger reliance on vibration-based features by the model in these fault categories. Specifically, Class 3 exhibits the greatest disparity, highlighting vibration signals as particularly informative for this fault type. For Class 4, although the attention allocated to vibration signals remains predominant, attention given to acoustic signals is relatively elevated, reflecting a more nuanced reliance on both modal features.

Overall, these findings confirm that the employed attention mechanism effectively guides the MTMAGNet model to adaptively prioritize modality-specific information depending on fault characteristics. By dynamically adjusting attention weights, the model achieves optimized multimodal feature integration, thereby significantly enhancing its fault classification performance. This adaptive capability underscores the effectiveness and robustness of our proposed multimodal fusion approach in fault diagnosis tasks.

In order to assess the classification accuracy of the MTMAGNet model under varying levels of noise, noise with different standard deviations was introduced to the original acoustic and vibration signals, as shown in Figure 8. The X-axis represents the noise standard deviation, while the Y-axis shows the corresponding classification accuracy of the model. The results reveal that the MTMAGNet model maintains a high accuracy, close to 100%, when the noise standard deviation is low (0 to 0.5). However, as the noise intensity increases, the model’s accuracy begins to decline noticeably, with a gradual drop observed as the noise standard deviation reaches 1.0. Beyond this threshold, the model’s performance degrades significantly, with accuracy falling below 90% when the noise standard deviation exceeds 2.0. These results indicate that MTMAGNet exhibits strong robustness under low-to-moderate noise levels, whereas its performance degrades noticeably under severe noise. Therefore, the robustness of the proposed method should be understood as conditional rather than uniform across all noise intensities.

In order to visualize the feature embedding space of the Graph Convolutional Network (GCN) at different training stages for our proposed model, the results are shown in Figure 9. Starting from the first epoch, the node features are relatively dispersed, with nodes from different categories intermingling, showing no clear class boundaries. As training progresses, nodes gradually form clustered distributions, with category boundaries becoming more distinct, and the relationships between nodes gradually strengthen. The model gradually learns to distinguish different feature patterns. At epochs 5 and 10, some categories begin to separate and form initial clusters, indicating that the model has started to identify and differentiate different feature patterns. By epoch 15, the distinctions between categories become more pronounced, and the clustering of nodes significantly improves, indicating that the model has effectively partitioned the feature space. By epoch 20, nodes from different categories form distinct and dense clusters with clear boundaries, demonstrating that the model has achieved strong classification capabilities in the embedding space.

4.3. Comparison with Other Existing Methods

To ensure a fair comparison, all comparative models were trained and evaluated under the same experimental protocol. Specifically, the same train/validation/test splits, signal segmentation strategy, normalization procedure, batch size, optimizer type, training epochs, and evaluation metrics were used for all models. For the deep learning baselines, hyperparameters were selected under the same validation protocol, and no model had access to test-set information during training or model selection. Moreover, for the compared multimodal deep-learning baselines, the same or equivalent 1D-CNN feature-extraction backbone was used whenever applicable, so that performance differences were mainly attributable to the subsequent fusion, sequence modeling, or graph-learning modules rather than to unequal front-end feature-extraction capacity.

We compare the bearing fault classification accuracy of MTMAGNet with five different models to evaluate its performance in intelligent diagnosis. The selected comparison models are: (1) Adaptive Graph Convolutional Network (AGCN [39]), which dynamically adjusts the graph structure based on data to handle non-static relationships, allowing us to compare AGCN’s adaptability with the multi-task learning and attention mechanisms in MTMAGNet; (2) Dual Graph Convolutional Network (Dual-GCN [40]), which processes data through two independent graph structures to capture different relationships, serving to evaluate the multi-task learning in MTMAGNet; (3) Graph Attention Network (GAT [41]), which introduces node-level attention mechanisms to focus on important neighboring nodes, enabling an assessment of modal attention versus node-level attention; (4) Convolutional Neural Network (CNN), where features from acoustic and vibration signals are separately extracted, concatenated, and classified through a fully connected layer to verify MTMAGNet’s superiority over traditional CNNs; and (5) Long Short-Term Memory Network (LSTM), which extracts sequential features suitable for capturing dependencies in acoustic and vibration signals, for comparison with MTMAGNet.

In order to compare fault classification accuracy across eight trials between the MTMAGNet method and five other techniques, the results are shown in Figure 10. The results indicate that MTMAGNet consistently achieved the highest classification accuracy, approaching or reaching 100%, outperforming all the other methods. Specifically, AGCN and Dual-GCN exhibited certain advantages in adjusting graph structures and facilitating multi-task learning; however, their accuracies were still lower than MTMAGNet, underscoring the effectiveness of the synergistic interaction between multimodal attention mechanisms and multi-task learning in the proposed approach. Compared to GAT, MTMAGNet demonstrated more precise attention allocation at the modality level, enhancing the utilization of features across different modalities. Furthermore, in contrast to traditional CNN and LSTM, MTMAGNet captured acoustic and vibration signal features and their interrelationships more effectively, resulting in higher diagnostic accuracy. In summary, the experimental results validate the superior performance of MTMAGNet in intelligent fault diagnosis, particularly in the areas of multimodal data fusion and optimized attention mechanisms.

Figure 11 summarizes the average classification accuracy of each method, with error bars indicating the accuracy fluctuation range. The results show that the MTMAGNet model achieves the highest average accuracy, approaching 100%, with the smallest error range, demonstrating exceptional stability and robustness. In comparison, AGCN, GAT, and Dual-GCN have slightly lower accuracies, indicating that the multi-modality attention mechanism and multi-task learning in MTMAGNet effectively enhance fault classification performance. Traditional methods like CNN and LSTM perform worse, further validating the superiority of MTMAGNet in intelligent diagnostic tasks.

The confusion matrices for six methods are shown in Figure 12. The results obtained from the figure are as follows: the MTMAGNet model achieves nearly perfect classification across all fault classes, with no significant misclassifications, highlighting its robustness and accuracy in fault diagnosis. AGCN and Dual-GCN perform relatively well but exhibit occasional misclassifications, particularly in classes 3 and 5. GAT demonstrates competitive performance but shows slightly lower accuracy in class 5. CNN and LSTM, traditional deep learning models, exhibit the highest misclassification rates, especially in classes 2, 3, and 5, indicating limitations in handling complex fault patterns. Overall, the MTMAGNet model surpasses other models in classification accuracy, demonstrating its effectiveness and reliability for intelligent fault diagnosis.

The classification accuracy of six methods under different noise levels is presented, as shown in Table 8. The results show that MTMAGNET achieves the highest accuracy across all noise levels, especially under low noise conditions (0.2 and 0.5), with accuracy approaching 100%, demonstrating its excellent noise resistance. As the noise level increases, the accuracy of MTMAGNET slightly decreases but still remains superior to other comparative models, maintaining an accuracy of 91.61% even at a noise level of 2. In contrast, Dual-GCN and AGCN exhibit relatively good noise resistance but have significantly lower accuracy than MTMAGNET under high noise conditions (noise level 2). GAT and CNN perform reasonably well at moderate and low noise levels (0.5 and 1), but their accuracy declines markedly under higher noise conditions. LSTM performs the worst across all noise levels, with accuracy dropping significantly to 65.12% at high noise levels. The experimental results further indicate that the MTMAGNET model has high robustness and accuracy under various noise conditions.

To evaluate the classification accuracy of the proposed algorithm under limited small sample conditions, we investigated its feature processing capability at different training sample sizes. The sample sizes of the training dataset were set to 10%, 30%, and 50%. Table 9 presents the experimental comparison results of six methods under different sample scales. The results in Table 4 show that MTMAGNet consistently outperforms other methods in classification performance across various training sample proportions, demonstrating exceptional small-sample handling capability. Its average accuracy remains at a high level and shows the least sensitivity to sample size variations. In contrast, graph neural network-based methods such as Dual-GCN and AGCN rank second, exhibiting strong stability and relatively high classification accuracy. However, traditional methods such as CNN and LSTM perform poorly in small-sample scenarios, particularly LSTM, which shows limited improvement in performance as the sample size increases.

Although the above results verify the superiority of MTMAGNet under the conventional experimental setting, further evaluation under different rotational-speed conditions is still necessary to assess its generalization ability in practical applications.

4.4. Generalization Analysis Under Unseen Rotational Speeds

To further verify the robustness and generalization capability of the proposed MTMAGNet under varying operating conditions, additional experiments were conducted at three different rotational speeds, namely 1800 rpm, 2400 rpm, and 3000 rpm. On this basis, two evaluation settings were considered. The first is the same-speed setting, in which the training and testing data were collected under the same rotational speed. The second is the cross-speed setting, in which the training and testing data were collected under different rotational speeds. Compared with the conventional evaluation protocol, the cross-speed setting is more challenging and can better reflect the practical application scenario in which the operating speed of the equipment may vary during deployment.

Table 10 presents the same-speed diagnostic performance of different methods at 1800 rpm, 2400 rpm, and 3000 rpm. It can be observed that all multimodal models achieve relatively high classification accuracy under the same-speed setting. Among them, MTMAGNet consistently exhibits the best performance at all three speeds, reaching 99.82% at 1800 rpm, 99.29% at 2400 rpm, and 100.00% at 3000 rpm, with an average accuracy of 99.70%. In contrast, CNNTransformer achieves average accuracy of 96.61%, LateFusionCNN achieves 95.54%, and CNNBiLSTM achieves 91.79%. These results indicate that the proposed multimodal attention and graph-based learning framework maintains highly stable diagnostic capability under different fixed-speed conditions and outperforms the other compared multimodal baselines in terms of both accuracy and consistency.

From Figure 13, it is evident that MTMAGNet achieves the best confusion-matrix distribution among all compared methods. Its prediction results are concentrated almost entirely on the main diagonal, indicating excellent classification consistency. In contrast, the other methods show more obvious off-diagonal errors, especially CNNBiLSTM and CNNTransformer, which exhibit substantial confusion among several fault categories. The dominant confusion in MTMAGNet only occurs between Classes C2 and C3, whereas the competing methods show broader and more severe misclassification patterns. These observations demonstrate that the proposed MTMAGNet can more effectively preserve inter-class separability and reduce confusion between fault states with similar signal characteristics.

To further evaluate the model under conditions different from those used for training, cross-speed experiments were conducted, and the corresponding results are summarized in Table 11. In this setting, the model was trained using data from two rotational speeds and tested on the remaining unseen rotational speed. The results show that the classification performance of all methods decreases compared with the same-speed setting, which confirms that changes in rotational speed introduce a clear distribution shift and increase the difficulty of fault diagnosis. Nevertheless, MTMAGNet still achieves the best performance under all three cross-speed settings, obtaining 74.61%, 67.46%, and 75.36%, respectively, with an average accuracy of 72.48%. By comparison, CNNTransformer achieves an average accuracy of 57.62%, LateFusionCNN achieves 57.24%, and CNNBiLSTM achieves 53.56%. Therefore, although the unseen-speed setting is significantly more challenging than the same-speed setting, MTMAGNet still preserves the highest classification accuracy among all compared methods, which demonstrates its stronger generalization ability.

From the detailed cross-speed results, it can also be found that testing on 2400 rpm is the most difficult setting for all methods. Under this condition, MTMAGNet still achieves 67.46% accuracy, whereas CNNTransformer, LateFusionCNN, and CNNBiLSTM achieve only 38.61%, 52.54%, and 40.79%, respectively. This result suggests that the feature distribution at 2400 rpm differs more substantially from that learned from the other two speeds, leading to more severe domain shift. Even under this challenging condition, the proposed MTMAGNet retains a clear performance advantage, which can be attributed to the synergistic effect of modality attention, graph-based relational modeling, and auxiliary modality supervision.

To provide a more detailed class-level interpretation, Table 12 reports the class-wise precision, recall, and F1-score of different methods under the representative 3000 rpm setting. Among all compared models, MTMAGNet achieves the most balanced class-level performance, with perfect precision, recall, and F1-score of 1.0000 for Classes C0, C1, C3, C4, C5, and C6, and a near-perfect F1-score of 0.9937 for Class C2. This indicates that the proposed method can stably identify both normal and faulty states under this condition. LateFusionCNN also performs well overall, but its recognition of Class C5 and Class C6 is slightly weaker, with F1-scores of 0.9809 and 0.9816, respectively. CNNBiLSTM shows more pronounced degradation in several classes, especially Class C2, for which the F1-score drops to 0.8944. CNNTransformer performs strongly on most categories, but its performance on Class C1 is clearly lower, with an F1-score of 0.8414, indicating limited robustness for this fault type.

The most confused class pairs are summarized in Table 13. For MTMAGNet, the dominant confusion occurs between Class C2 and Class C3, with only two samples of Class C2 being misclassified as Class C3 and one sample of Class C3 being misclassified as Class C2, which indicates that the overall confusion level of the proposed method remains very low. By contrast, the competing methods exhibit more severe class confusion. For example, in CNNTransformer, Class C1 is frequently misclassified as Class C2, with 13 confused samples, and as Class C6, with 6 confused samples. In CNNBiLSTM, the most prominent confusion occurs between Class C5 and Class C2, with 6 misclassified samples. These results suggest that the competing methods are more likely to confuse fault categories with similar impulsive characteristics, whereas MTMAGNet can better preserve inter-class separability.

Overall, the above results demonstrate that MTMAGNet not only achieves the highest accuracy under the same-speed setting, but also maintains the best diagnostic performance under the more challenging cross-speed setting. Meanwhile, the class-wise analysis and confusion-pair analysis further confirm that the proposed method provides more balanced and reliable discrimination among different bearing states. Therefore, the proposed MTMAGNet exhibits both superior classification performance and stronger generalization capability than the other multimodal baseline methods.

5. Conclusions

This paper proposed a Multi-Task Multimodal Attention Graph Convolutional Network (MTMAGNet) for acoustic–vibration fusion-based rolling bearing fault diagnosis. By integrating modality-level attention, graph-based relational learning, and an auxiliary modality-classification task, the proposed method effectively exploits complementary multimodal information and improves fault-representation learning. Experimental results demonstrated that MTMAGNet outperformed the compared methods under multiple evaluation settings. In the same-speed experiments, it achieved the best performance at 1800 rpm, 2400 rpm, and 3000 rpm, with an average accuracy of 99.70%. Under the more challenging cross-speed setting, MTMAGNet still obtained the highest average accuracy of 72.48%, indicating stronger generalization ability than the compared multimodal baselines. The ablation, class-wise, and confusion-matrix results further verified the effectiveness of modality attention, graph learning, and auxiliary modality supervision in improving diagnostic performance.

The noise analysis showed that MTMAGNet is robust under low-to-moderate noise levels, but its performance degrades under severe noise. In addition, although the sensitivity analysis indicated that k = 5 provided the best trade-off under the current setting, the proposed framework still depends on the graph-construction strategy. Compared with simpler models, MTMAGNet also introduces additional computational cost due to multimodal attention, dynamic graph construction, and multi-task optimization. Future work will therefore focus on more adaptive graph construction, more efficient lightweight implementation, and broader validation on other rotating machinery and fault types.

Author Contributions

Conceptualization, T.W. and Y.T.; methodology, T.W.; software, T.W.; validation, T.W., Y.H. and Y.L.; formal analysis, T.W.; investigation, Y.T.; resources, Y.T. and T.W.; data curation, T.W.; writing—original draft preparation, T.W.; writing—review and editing, T.W., Y.H. and Y.L.; visualization, Y.T.; supervision, Y.T. and T.W.; project administration, Y.T.; fund. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by the Liaoning Province Doctoral Research Startup Fund Project (No. 2025-BS-0312).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The acoustic and vibration datasets used in this study, including the raw recorded signals and the processed 1024-sample windows, will be made publicly available upon publication of this article.

Acknowledgments

We gratefully acknowledge the financial support from the Liaoning Province Doctoral Research Startup Fund Project.

Conflicts of Interest

The author declares no conflicts of interest.

References

Yang, Y.; Han, C.; Ran, G.; Ma, T.; Pan, J. Fault Diagnosis of Rolling Element Bearing Based on BiTCN-Attention and OCSSA Mechanism. Actuators 2025, 14, 218. [Google Scholar] [CrossRef]
Li, X.; Jiang, H.; Niu, M.; Wang, R. An enhanced selective ensemble deep learning method for rolling bearing fault diagnosis with beetle antennae search algorithm. Mech. Syst. Signal Process. 2020, 142, 106752. [Google Scholar] [CrossRef]
Jia, F.; Lei, Y.; Shan, H.; Lin, J. Early fault diagnosis of bearings using an improved spectral kurtosis by maximum correlated kurtosis deconvolution. Sensors 2015, 15, 29363–29377. [Google Scholar] [CrossRef]
Liu, Q.; Chen, F.; Zhou, Z.; Wei, Q. Fault diagnosis of rolling bearing based on wavelet package transform and ensemble empirical mode decomposition. Adv. Mech. Eng. 2013, 5, 792584. [Google Scholar] [CrossRef]
Zhang, Q.; Deng, L. An intelligent fault diagnosis method of rolling bearings based on short-time Fourier transform and convolutional neural network. J. Fail. Anal. Prev. 2023, 23, 795–811. [Google Scholar] [CrossRef]
Guo, Y.; Yang, Y.; Jiang, S.; Jin, X.; Wei, Y. Rolling bearing fault diagnosis based on successive variational mode decomposition and the EP index. Sensors 2022, 22, 3889. [Google Scholar] [CrossRef]
Wang, Z.; Yao, L.; Chen, G.; Ding, J. Modified multiscale weighted permutation entropy and optimized support vector machine method for rolling bearing fault diagnosis with complex signals. ISA Trans. 2021, 114, 470–484. [Google Scholar] [CrossRef]
Bastami, A.R.; Vahid, S. A comprehensive evaluation of the effect of defect size in rolling element bearings on the statistical features of the vibration signal. Mech. Syst. Signal Process. 2021, 151, 107334. [Google Scholar] [CrossRef]
Ni, Q.; Ji, J.C.; Feng, K.; Halkon, B. A fault information-guided variational mode decomposition (FIVMD) method for rolling element bearings diagnosis. Mech. Syst. Signal Process. 2022, 164, 108216. [Google Scholar] [CrossRef]
Kumar, P.S.; Kumaraswamidhas, L.A.; Laha, S.K. Selection of efficient degradation features for rolling element bearing prognosis using Gaussian process regression method. ISA Trans. 2021, 112, 386–401. [Google Scholar] [CrossRef]
Zhao, D.; Liu, S.; Du, H.; Wang, L.; Miao, Z. Deep branch attention network and extreme multi-scale entropy based single vibration signal-driven variable speed fault diagnosis scheme for rolling bearing. Adv. Eng. Inform. 2023, 55, 101844. [Google Scholar] [CrossRef]
Li, X.; Hu, H.; Zhang, S.; Tang, G. A fault diagnosis method for rotating machinery with semi-supervised graph convolutional network and images converted from vibration signals. IEEE Sens. J. 2023, 23, 11946–11955. [Google Scholar] [CrossRef]
Chen, R.; Tang, L.; Hu, X.; Wu, H. Fault diagnosis method of low-speed rolling bearing based on acoustic emission signal and subspace embedded feature distribution alignment. IEEE Trans. Ind. Inform. 2021, 17, 5402–5410. [Google Scholar] [CrossRef]
Li, C.; Chen, C.; Gu, X. Acoustic-based rolling bearing fault diagnosis using a co-prime circular microphone array. Sensors 2023, 23, 3050. [Google Scholar] [CrossRef]
Shiri, H.; Wodecki, J.; Ziętek, B.; Zimroz, R. Inspection robotic UGV platform and the procedure for an acoustic signal-based fault detection in belt conveyor idler. Energies 2021, 14, 7646. [Google Scholar] [CrossRef]
Hou, D.; Qi, H.; Li, D.; Wang, C.; Han, D.; Luo, H.; Peng, C. High-speed train wheel set bearing fault diagnosis and prognostics: Research on acoustic emission detection mechanism. Mech. Syst. Signal Process. 2022, 179, 109325. [Google Scholar] [CrossRef]
Sun, Y.; Cao, Y.; Xie, G.; Wen, T. Sound based fault diagnosis for RPMs based on multi-scale fractional permutation entropy and two-scale algorithm. IEEE Trans. Veh. Technol. 2021, 70, 11184–11192. [Google Scholar] [CrossRef]
Tang, X.; Xu, Y.; Sun, X.; Liu, Y.; Jia, Y.; Gu, F.; Ball, A.D. Intelligent fault diagnosis of helical gearboxes with compressive sensing based non-contact measurements. ISA Trans. 2023, 133, 559–574. [Google Scholar] [CrossRef]
Li, X.; Cheng, J.; Shao, H.; Liu, K.; Cai, B. A fusion CWSMM-based framework for rotating machinery fault diagnosis under strong interference and imbalanced case. IEEE Trans. Ind. Inform. 2022, 18, 5180–5189. [Google Scholar] [CrossRef]
Gao, B.; Zhao, H.; Miao, X. A novel multi-model cascade framework for pipeline defects detection based on machine vision. Measurement 2023, 220, 113374. [Google Scholar] [CrossRef]
Yan, J.; Liao, J.; Gao, J.; Zhang, W.; Huang, C.; Yu, H. Fusion of audio and vibration signals for bearing fault diagnosis based on a quadratic convolution neural network. Sensors 2023, 23, 9155. [Google Scholar] [CrossRef]
Tang, L.; Wu, X.; Wang, D.; Liu, X. A comparative experimental study of vibration and acoustic emission on fault diagnosis of low-speed bearing. IEEE Trans. Instrum. Meas. 2023, 72, 3529211. [Google Scholar] [CrossRef]
Buchaiah, S.; Shakya, P. Bearing fault diagnosis and prognosis using data fusion based feature extraction and feature selection. Measurement 2022, 188, 110506. [Google Scholar] [CrossRef]
Saufi, M.S.R.M.; Isham, M.F.; Talib, M.H.A.; Zain, M.Z.M. Extremely low-speed bearing fault diagnosis based on raw signal fusion and DE-1D-CNN network. J. Vib. Eng. Technol. 2024, 12, 5935–5951. [Google Scholar] [CrossRef]
Guo, X. Fault diagnosis of rolling bearings based on acoustics and vibration engineering. IEEE Access 2024, 12, 139632–139648. [Google Scholar] [CrossRef]
Pacheco-Cherrez, J.; Fortoul-Diaz, J.A.; Cortes-Santacruz, F.; Aloso-Valerdi, L.M.; Ibarra-Zarate, D.I. Bearing fault detection with vibration and acoustic signals: Comparison among different machine learning classification methods. Eng. Fail. Anal. 2022, 139, 106515. [Google Scholar] [CrossRef]
Chu, Z.; Xing, S.; Han, B.; Wang, J. A novel cross-domain mechanical fault diagnosis method fusing acoustic and vibration signals by vision transformer. Sensors 2024, 24, 5120. [Google Scholar] [CrossRef]
Liu, Y.; Xiang, H.; Jiang, Z.; Xiang, J. A domain adaptation ResNet model to detect faults in roller bearings using vibro-acoustic data. Sensors 2023, 23, 3068. [Google Scholar]
Li, Y.; Xie, S.; Wang, J.; Yang, L. Train bogie bearings fault diagnosis model based on multimodal signal features and physics knowledge learning. IEEE Trans. Reliab. 2024, 74, 3695–3707. [Google Scholar] [CrossRef]
Li, X.; Wang, Y.; Yao, J.; Li, M.; Gao, Z. Multi-sensor fusion fault diagnosis method of wind turbine bearing based on viewable neural networks. Reliab. Eng. Syst. Saf. 2024, 245, 109980. [Google Scholar] [CrossRef]
Meng, Z.; Zhu, J.; Cao, S.; Li, P.; Xu, C. Bearing fault diagnosis under multisensor fusion based on modal analysis and graph attention network. IEEE Trans. Instrum. Meas. 2023, 72, 3526510. [Google Scholar] [CrossRef]
Liu, T.; Meidani, H. End-to-end heterogeneous graph neural networks for traffic assignment. Transp. Res. Part C. 2024, 165, 104695. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, S.; Zhu, Y.; Ke, W. Cross-domain bearing fault diagnosis using dual-path convolutional neural networks and multi-parallel graph convolutional networks. ISA Trans. 2024, 152, 129–142. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; Yang, D.; Liu, X.; Chen, X.; Liang, Z.; Wang, H.; Cui, Y.; Gu, J. TodyNet: Temporal dynamic graph neural network for multivariate time series classification. Inf. Sci. 2024, 677, 120914. [Google Scholar] [CrossRef]
Kavianpour, M.; Ramezani, A.; Beheshti, M.T. A class alignment method based on graph convolution neural network for bearing fault diagnosis in presence of missing data and changing working conditions. Measurement 2022, 199, 111536. [Google Scholar] [CrossRef]
Ghorvei, M.; Kavianpour, M.; Beheshti, M.T.; Ramezani, A. Spatial graph convolutional neural network via structured subdomain adaptation and domain adversarial learning for bearing fault diagnosis. Neurocomputing 2023, 517, 44–61. [Google Scholar] [CrossRef]
Ait Ichou, I.; Elouaham, S.; Nassiri, B.; Isknan, J. Deep Multimodal Learning for Heart Sound Classification Using CNN, Transformer, and BiLSTM with Attention. Symmetry 2026, 18, 556. [Google Scholar] [CrossRef]
Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
Li, R.; Wang, S.; Zhu, F.; Huang, J. Adaptive Graph Convolutional Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2018; Volume 32. [Google Scholar] [CrossRef]
Wang, H.; Wang, J.; Zhao, Y.; Liu, Q.; Liu, M.; Shen, W. Few-shot learning for fault diagnosis with a dual graph neural network. IEEE Trans. Ind. Inform. 2023, 19, 1559–1568. [Google Scholar] [CrossRef]
Ning, S.; Ren, Y.; Wu, Y. Intelligent fault diagnosis of rolling bearings based on the visibility algorithm and graph neural networks. J. Braz. Soc. Mech. Sci. Eng. 2023, 45, 72. [Google Scholar] [CrossRef]

Figure 1. The construction and iteration process of GCN.

Figure 2. The main framework of MTMAGNet.

Figure 3. The experimental test device.

Figure 4. The time-domain signals of 7 different fault types.

Figure 5. Results of different ablation methods.

Figure 6. Classification accuracy of MTMAGNet under different weights.

Figure 7. The distribution of attention weights by MTMAGNet.

Figure 8. The classification results of MTMAGNet under different noise levels.

Figure 9. Visualization results of the MTMAGNetmethod at different iteration processes.

Figure 10. The comparison results of different methods under each experiment.

Figure 11. The average classification accuracy of each method.

Figure 12. Comparison results of confusion matrix of different methods.

Figure 13. Confusion matrices of different methods under the same-speed setting at 3000 rpm.

Table 1. Structural parameters of rolling bearings.

Structural Parameters	Parameter Values	Structural Parameters	Parameter Values
Bearing type	MB ER-8K	Contact angle	0°
Inside diameter	0.91 in	The number of rollers	8
Pitch diameter	1.32 in	Roller diameter	0.312 in
Sound array plate	BSWA MPA416	Sound collector	PAK MK II-SC42
Vibration collector	BSZ800D-16

Table 2. Setup of the fault dataset of bearings.

Bearing Condition	Working Speed (r/min)	Acoustic Training/ Testing Sample	Vibration Training/ Testing Sample	Label
Two normal bearings	1800/2400/3000	400/100	400/100	C0
Outer race failure (near the sensor), one normal	1800/2400/3000	400/100	400/100	C1
Inner race failure (near the sensor), one normal	1800/2400/3000	400/100	400/100	C2
Ball failure (near the sensor), one normal	1800/2400/3000	400/100	400/100	C3
Compound failure of inner and outer races (near the sensor), one normal	1800/2400/3000	400/100	400/100	C4
Outer race failure (near the sensor), inner race failure	1800/2400/3000	400/100	400/100	C5
Outer race failure, inner race failure (near the sensor)	1800/2400/3000	400/100	400/100	C6

Table 3. The structure and parameter setting of the method are presented.

Module	Layer	Type	Kernel Size	Input Size	Output Size	Operation Details
Acoustic Feature Extractor	conv1	Conv1d	5	16	64	BatchNorm1d + ReLU + MaxPool1d(2)
	conv2	Conv1d	5	64	128	BatchNorm1d + ReLU + MaxPool1d(2)
	conv3	Conv1d	3	128	256	BatchNorm1d + ReLU + AdaptiveAvgPool1d
	dropout	Dropout	-	-	-	p = 0.5
	fc	Linear	-	256	128	-
Vibration Feature Extractor	conv1	Conv1d	5	1	64	BatchNorm1d + ReLU + MaxPool1d(2)
	conv2	Conv1d	5	64	128	BatchNorm1d + ReLU + MaxPool1d(2)
	conv3	Conv1d	3	128	256	BatchNorm1d + ReLU + AdaptiveAvgPool1d
	dropout	Dropout	-	-	-	p = 0.5
	fc	Linear	-	256	128	-
Modality Attention	(W_a)	Linear	-	128	64	tanh
	(W_v)	Linear	-	128	64	tanh
	(w)	Linear	-	64	1	-
Graph Layer	gcn1	GCNConv	-	128	64	ReLU
	gcn2	GCNConv	-	64	7	logits
Modal Classifier	fc	Linear	-	128	2	Softmax

Table 4. The hyper-parameters of the proposed model.

Symbols	Parameter	Symbols	Parameter
batch_size	64	learning rate	0.0001
optimizer	Adam	weight_decay	1 × 10⁻⁴
num_epochs	100	step_size	50
gamma	0.5	lambda_class	1.0
lambda_modality	0.4	n_neighbors	5
window_size	1024	num_samples_per_file	400

Table 5. Sensitivity analysis of the number of neighbors k.

k	Average Accuracy
3	99.13
5	99.98
7	99.01
9	99.14
11	98.97

Table 6. Results of 8 repeated ablation experiments.

Diagnosis Methods	Max-acc	Min-acc	Average-acc
MTMAGNet	100.0%	99.11%	99.78 ± 0.31%
MMA-GCN	97.46%	95.14%	96.55 ± 0.41%
MTL-GCN	95.93%	94.79%	94.90 ± 0.79%
VA-GCN	96.46%	93.14%	94.35 ± 0.81%
OV-GCN	92.75%	90.71%	91.81 ± 0.94%
OA-GCN	90.57%	81.61%	86.16 ± 1.39%

Table 7. Top-performing loss-weight combinations from the grid search.

Rank	lambda_Class	lambda_Modality	Average Accuracy (%)	Macro-F1 (%)
1	0.8	0.6	99.50	99.47
2	1.0	0.6	99.40	99.36
3	0.8	0.4	99.30	99.26
4	0.6	0.6	99.20	99.18
5	1.0	0.4	99.10	99.07
6	0.6	0.4	99.00	98.96
7	0.8	0.8	98.90	98.86
8	0.4	0.6	98.80	98.75
9	0.6	0.8	98.70	98.66
10	0.4	0.4	98.60	98.55

Table 8. Comparative analysis results of different methods under different noise levels.

Model	Noise Level
Model	2	1.5	1	0.5	0.2
MTMAGNet	91.61%	96.43%	99.29%	99.28%	99.64%
Dual-GCN	89.46%	91.07%	97.32%	96.79%	98.93%
AGCN	84.82%	89.28%	96.07%	97.86%	98.57%
GAT	83.39%	87.51%	97.50%	97.32%	97.50%
CNN	79.11%	87.33%	93.04%	95.41%	95.97%
LSTM	65.12%	66.12%	80.03%	83.36%	83.86%

Table 10. Same-speed diagnostic performance of different methods at 1800 rpm, 2400 rpm, and 3000 rpm.

Method	1800 rpm (%)	2400 rpm (%)	3000 rpm (%)	Avg (%)
MTMAGNet	99.82	99.29	100.00	99.70
CNNTransformer	91.61	98.75	99.46	96.61
LateFusionCNN	91.61	95.71	99.29	95.54
CNNBiLSTM	83.04	97.68	94.64	91.79

Table 11. Cross-speed generalization performance of different methods under unseen rotational speeds.

Method	Train 2400 + 3000, Test 1800 (%)	Train 1800 + 3000, Test 2400 (%)	Train 1800 + 2400, Test 3000 (%)	Avg (%)
MTMAGNet	74.61	67.46	75.36	72.48
CNNTransformer	74.36	38.61	59.89	57.62
LateFusionCNN	57.46	52.54	61.71	57.24
CNNBiLSTM	67.75	40.79	52.14	53.56

Table 12. Class-wise precision, recall, and F1-score of different methods at 3000 rpm.

Method	Class	Precision	Recall	F1	Support
LateFusionCNN	C0	1.0000	1.0000	1.0000	80
LateFusionCNN	C1	1.0000	1.0000	1.0000	80
LateFusionCNN	C2	1.0000	0.9875	0.9937	80
LateFusionCNN	C3	0.9877	1.0000	0.9938	80
LateFusionCNN	C4	1.0000	1.0000	1.0000	80
LateFusionCNN	C5	1.0000	0.9625	0.9809	80
LateFusionCNN	C6	0.9639	1.0000	0.9816	80
CNNBiLSTM	C0	1.0000	1.0000	1.0000	80
CNNBiLSTM	C1	0.9375	0.9375	0.9375	80
CNNBiLSTM	C2	0.8889	0.9000	0.8944	80
CNNBiLSTM	C3	0.9756	1.0000	0.9877	80
CNNBiLSTM	C4	0.9398	0.9750	0.9571	80
CNNBiLSTM	C5	1.0000	0.9250	0.9610	80
CNNBiLSTM	C6	0.9625	0.9625	0.9625	80
MTMAGNet	C0	1.0000	1.0000	1.0000	80
MTMAGNet	C1	1.0000	1.0000	1.0000	80
MTMAGNet	C2	0.9938	1.0000	0.9937	80
MTMAGNet	C3	1.0000	1.0000	1.0000	80
MTMAGNet	C4	1.0000	1.0000	1.0000	80
MTMAGNet	C5	1.0000	1.0000	1.0000	80
MTMAGNet	C6	1.0000	1.0000	1.0000	80
CNNTransformer	C0	1.0000	1.0000	1.0000	80
CNNTransformer	C1	0.9385	0.7625	0.8414	80
CNNTransformer	C2	0.8352	0.9500	0.8889	80
CNNTransformer	C3	0.9512	0.9750	0.9630	80
CNNTransformer	C4	0.9877	1.0000	0.9938	80
CNNTransformer	C5	1.0000	0.9875	0.9937	80
CNNTransformer	C6	0.9268	0.9500	0.9383	80

Table 13. Most confused class pairs of different methods at 3000 rpm.

Method	True Class	Predicted Class	Count	Rate
LateFusionCNN	C5	C6	3	0.0375
LateFusionCNN	C2	C3	1	0.0125
CNNBiLSTM	C5	C2	6	0.0750
CNNBiLSTM	C1	C4	3	0.0375
CNNBiLSTM	C2	C1	2	0.0250
CNNBiLSTM	C2	C3	2	0.0250
CNNBiLSTM	C2	C4	2	0.0250
CNNBiLSTM	C2	C6	2	0.0250
CNNBiLSTM	C4	C1	2	0.0250
CNNBiLSTM	C6	C2	2	0.0250
CNNBiLSTM	C1	C2	1	0.0125
CNNBiLSTM	C1	C6	1	0.0125
MTMAGNet	C2	C3	2	0.0250
MTMAGNet	C3	C2	1	0.0125
CNNTransformer	C1	C2	13	0.1625
CNNTransformer	C1	C6	6	0.0750
CNNTransformer	C2	C3	4	0.0500
CNNTransformer	C6	C1	4	0.0500
CNNTransformer	C3	C2	2	0.0250
CNNTransformer	C5	C4	1	0.0125

Table 9. Analysis results under different datasets.

Scale	Method	1st	2nd	3rd	4th	5th	Average
10%	MTMAGNet	98.21	96.43	98.23	99.91	94.64	97.48
	Dual-GCN	95.43	93.41	89.29	89.29	91.07	91.70
	AGCN	79.00	92.86	92.81	87.50	85.71	87.58
	GAT	92.86	89.29	78.57	89.29	87.50	87.50
	CNN	91.07	87.05	86.39	83.93	83.36	86.36
	LSTM	78.01	74.39	64.29	70.18	77.86	72.95
30%	MTMAGNet	99.99	99.40	99.98	99.98	100.00	99.87
	Dual-GCN	95.62	95.62	96.21	96.21	97.40	96.21
	AGCN	96.81	94.43	93.83	95.62	97.40	95.50
	GAT	96.21	97.01	93.83	96.81	96.81	96.13
	CNN	92.02	92.05	91.43	92.02	95.31	92.56
	LSTM	80.09	78.31	78.36	77.93	76.69	78.27
50%	MTMAGNet	99.98	100.00	98.91	99.21	99.64	99.55
	Dual-GCN	96.79	97.86	97.93	96.97	98.57	97.62
	AGCN	97.86	98.93	97.86	90.64	98.64	96.79
	GAT	96.43	93.21	98.21	97.86	97.14	96.57
	CNN	95.57	93.57	95.21	92.50	94.86	94.34
	LSTM	81.17	80.28	83.36	81.97	81.16	81.58

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, T.; Tang, Y.; He, Y.; Li, Y. A Multi-Task Multimodal Attention Graph Convolutional Network for Acoustic–Vibration Fusion-Based Rolling Bearing Fault Diagnosis. Appl. Sci. 2026, 16, 4310. https://doi.org/10.3390/app16094310

AMA Style

Wang T, Tang Y, He Y, Li Y. A Multi-Task Multimodal Attention Graph Convolutional Network for Acoustic–Vibration Fusion-Based Rolling Bearing Fault Diagnosis. Applied Sciences. 2026; 16(9):4310. https://doi.org/10.3390/app16094310

Chicago/Turabian Style

Wang, Tong, Yuanyuan Tang, Yibo He, and Yinghao Li. 2026. "A Multi-Task Multimodal Attention Graph Convolutional Network for Acoustic–Vibration Fusion-Based Rolling Bearing Fault Diagnosis" Applied Sciences 16, no. 9: 4310. https://doi.org/10.3390/app16094310

APA Style

Wang, T., Tang, Y., He, Y., & Li, Y. (2026). A Multi-Task Multimodal Attention Graph Convolutional Network for Acoustic–Vibration Fusion-Based Rolling Bearing Fault Diagnosis. Applied Sciences, 16(9), 4310. https://doi.org/10.3390/app16094310

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Task Multimodal Attention Graph Convolutional Network for Acoustic–Vibration Fusion-Based Rolling Bearing Fault Diagnosis

Abstract

1. Introduction

2. Proposed Method

2.1. Feature Extraction from Acoustic and Vibration Data

2.2. Modality Attention Mechanism

2.3. Graph Construction

2.4. Multi-Task Learning

2.5. Loss Function Construction

2.6. MTMAGNet Intelligent Diagnosis Framework

3. Experiment

4. Diagnostic Analysis

4.1. Experiment Result

4.2. Model Performance Analysis

4.3. Comparison with Other Existing Methods

4.4. Generalization Analysis Under Unseen Rotational Speeds

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI