Rail Surface Defect Diagnosis Based on Image–Vibration Multimodal Data Fusion

Wang, Zhongmei; Peng, Shenao; Ao, Wenxiu; Liu, Jianhua; Zhang, Changfan

doi:10.3390/bdcc9050127

Open AccessArticle

Rail Surface Defect Diagnosis Based on Image–Vibration Multimodal Data Fusion

by

Zhongmei Wang

^*,†,

Shenao Peng

^†,

Wenxiu Ao

^†,

Jianhua Liu

and

Changfan Zhang

College of Railway Transportation, Hunan University of Technology, Zhuzhou 412007, China

^*

Author to whom correspondence should be addressed.

^†

Theseauthors contributed equally to this work.

Big Data Cogn. Comput. 2025, 9(5), 127; https://doi.org/10.3390/bdcc9050127

Submission received: 17 April 2025 / Revised: 7 May 2025 / Accepted: 9 May 2025 / Published: 12 May 2025

Download

Browse Figures

Versions Notes

Abstract

To address the challenges in existing multi-sensor data fusion methods for rail surface defect diagnosis, particularly their limitations in fully exploiting potential synergistic information among multimodal data and effectively bridging the semantic gap between heterogeneous multi-source data, this paper proposes a diagnostic approach based on a Progressive Joint Representation Graph Attention Fusion Network (PJR-GAFN). The methodology comprises five principal phases: Firstly, shared and specific autoencoders are used to extract joint representations of multimodal features through shared and modality-specific representations. Secondly, a squeeze-and-excitation module is implemented to amplify defect-related features while suppressing non-essential characteristics. Thirdly, a progressive fusion module is introduced to comprehensively utilize cross-modal synergistic information during feature extraction. Fourthly, a source domain classifier and domain discriminator are employed to capture modality-invariant features across different modalities. Finally, the spatial attention aggregation properties of graph attention networks are leveraged to fuse multimodal features, thereby fully exploiting contextual semantic information. Experimental results on real-world rail surface defect datasets from domestic railway lines demonstrate that the proposed method achieves 95% diagnostic accuracy, confirming its effectiveness in rail surface defect detection.

Keywords:

diagnosis of rail surface defects; deep learning; multimodal data fusion; graph attention neural network

1. Introduction

Rail transportation is one of the most crucial forms of land transportation. With the steady development of railway infrastructure in China, the density of train operations, total mileage, and transport capacity has constantly increased [1]. As a critical component of railway infrastructure, the steel rail plays a major role in ensuring train safety. Due to continuous contact with train wheels and exposure to environmental factors, rail surfaces endure significant mechanical stress and are highly susceptible to damage and defect formation. Among these defects, surface spalling and flaking are particularly common, often attributable to contact fatigue or overloads caused by time-varying excitation forces [2]. If defects are not detected and repaired in a timely manner, they may lead to significant safety accidents [3,4]. Therefore, the development of intelligent and efficient diagnostic methods for rail surface defects is both necessary and urgent. With the rapid development of industrial big data, approaches based multimodal data fusion for rail surface defect diagnosis have garnered increasing attention and recognition. However, the multi-source heterogeneity inherent in multimodal data poses challenges for effective data fusion [5,6]. Overcoming the semantic gap between heterogeneous multi-source data and fully capturing the redundancy and complementarity between multimodal data are critical to the task of multimodal data fusion for rail surface defect diagnosis.

In recent years, deep learning has emerged as a leading approach in artificial intelligence, owing to its powerful autonomous feature learning capabilities. By adjusting network weights based on large-scale raw input, deep neural networks are capable of capturing complex patterns and relationships among data [7,8]. Leveraging these strengths, a number of deep learning-based multimodal data fusion methods for rail surface defect diagnosis have been developed. Chen et al. [6] proposed a multi-source heterogeneous data fusion network for rail surface defect diagnosis by integrating camera images and B-mode ultrasound images. Ge et al. [9] introduced a method based on visual–lidar decision fusion, which efficiently improved diagnostic accuracy through pre-detection and post-detection during the decision-making stage. Yang et al. [10] proposed a bidirectional feature alignment approach that explores the internal consistency of multimodal information from both profile and semantic perspectives, improving the fusion quality of multimodal data. Wang et al. [11] developed an RGB-D-based method, where depth images provided spatial information complementary to RGB images with the goal of service-free rail surface defect diagnosis, and proposed a multimodal attention module. A multimodal attention module with a novel cross-modal fusion strategy was proposed to highlight complex defect features. Piao et al. [12] designed a depth-induced multi-scale cyclic attention network based on RGB and depth images. The model incorporates a depth refinement module that effectively combines spatial depth cues with multi-scale contextual features. Zhao et al. [13] proposed a bimodal diagnostic network that integrates 2D and 3D rail surface information to detect top-surface abrasion. Liu et al. [14] combined self-attention mechanisms and other attention mechanisms to fuse attention information obtained from RGB and depth data, effectively propagating long-range semantic dependencies and efficiently combining multimodal information. Yang et al. [15] introduced a method that fuses one-dimensional vibration signals and two-dimensional images, extracting vibration features with sparse autoencoders and employing a support vector classifier for diagnosis. Shen et al. [5] developed a model using both vibration and image data, incorporating an attention mechanism to emphasize features at different scales and achieve dynamic weighting during the fusion process. Despite the promising results of existing deep learning-based multimodal data fusion methods for rail surface defect diagnosis, these approaches typically rely on independent feature extractors for each modality during the feature extraction stage. They fail to directly establish effective associations between the raw multimodal inputs, resulting in the underutilization of potential synergistic information. This limitation poses a latent risk to the efficiency and robustness of data fusion.

To directly establish effective associations between the raw multimodal input data; fully exploit the potential synergistic information across modalities; and effectively bridge the semantic gap between multi-source heterogeneous data, thereby achieving efficient data fusion, this paper proposes a Progressive Joint Representation Graph Attention Fusion Network (PJR-GAFN) for multimodal data fusion in rail surface defect diagnosis. The proposed method is experimentally validated on a real-world dataset collected from a domestic railway line, which includes both image and vibration data. Experimental results show that the model achieves high diagnostic accuracy for rail surface defect diagnosis. The main contributions of the proposed method are summarized as follows:

1. A joint domain separation representation module is integrated within a domain-adversarial training framework for multimodal feature extraction. This module enables direct associations between raw multimodal inputs and effectively bridges the semantic gap between heterogeneous sources.

2. A progressive fusion module is introduced during the multimodal feature extraction stage, enabling the model to fully exploit latent synergistic information across modalities.

3. A graph attention network is incorporated into the multimodal feature fusion module, allowing the model to efficiently leverage the inherent dependencies between multimodal features at similar time steps, thus achieving efficient fusion of semantic information across different time steps.

2. Model Overview

To effectively bridge the semantic gap between multi-source heterogeneous data, this paper proposes a joint domain separation representation network for feature extraction, which integrates Convolutional Autoencoders (CAEs) and a squeeze-and-excitation (SE) block within a domain-adversarial training framework. To further exploit the latent synergistic information among modalities, a Progressive Fusion Module (PFM) is introduced. In addition, a source domain classifier and domain discriminator are employed to capture modality-invariant features across different modalities. The extracted multimodal features are subsequently transformed into an undirected graph data structure and processed by a Graph Attention Network (GAT) to generate a multimodal fused representation. This representation is then input into a decision network for rail surface defect diagnosis. The subsequent sections detail the proposed Progressive Joint Representation Graph Attention Fusion Network (PJR-GAFN) from two perspectives: model architecture and learning strategy.

2.1. Model Structure

The Progressive Joint Representation Graph Attention Fusion Network (PJR-GAFN) comprises a shared autoencoder, modality-specific autoencoders, a Squeeze-and-Excitation (SE) block, a Progressive Fusion Module (PFM), a source domain classifier, a domain discriminator, and a Graph Attention Network (GAT). The overall architecture is illustrated in Figure 1, and the detailed representation process is described as follows.

2.1.1. Joint Domain Separation Representation

This paper employs a joint domain separation representation network, consisting of a shared autoencoder and modality-specific autoencoders, to learn the joint representation of modality-shared and modality-specific representations in the feature subspace of multimodal data. The modality-shared representation is designed to reduce the semantic distance between multimodal features, thereby alleviating the complexity of subsequent fusion tasks. In contrast, the modality-specific representation captures unique features from each modality, providing complementary information. The integration of both types of representations enables the model to fully exploit cross-modal complementarity, thereby effectively bridging the semantic gap inherent in heterogeneous multi-source data. Firstly, the original image data

x_{p}

and vibration data

x_{v}

are preprocessed through image enhancement and normalization. The resulting data are then input into the shared autoencoder (CAE_s) to extract modality-shared representations for both input. The corresponding process is described as follows:

h_{s}^{m} = {CAE}_{s} (x_{m}; θ_{s}), m \in {p, v}

(1)

Y_{s}^{m} = {CAE}_{s} (h_{s}^{m}; θ_{s}), m \in {p, v}

(2)

where

h_{s}^{m}

,

Y_{s}^{m}

, and

θ_{s}

are the shared representation, reconstructed representation, and network parameters of the shared autoencoder for the mth modality, respectively.

Next, the preprocessed image and vibration data are input into the specific autoencoder

{CAE}_{p}^{m}

to extract modality-specific representations, as detailed below:

h_{p}^{m} = {CAE}_{p}^{m} (x_{m}; θ_{p}^{m}), m \in {p, v}

(3)

Y_{p}^{m} = {CAE}_{p}^{m} (h_{p}^{m}; θ_{p}^{m}), m \in {p, v}

(4)

where

h_{p}^{m}

,

Y_{p}^{m}

, and

θ_{p}^{m}

are the modality-specific representation, reconstructed representation, and network parameters of the modality-specific autoencoder for the mth modality, respectively.

Furthermore, since rail surface monitoring data exhibit significant variations near defect occurrences compared to normal conditions, focusing on these anomalous periods enables more accurate capture of defect-related features and enhances subsequent diagnostic performance. The Squeeze-and-Excitation (SE) block can adaptively recalibrate channel-wise feature responses, allowing the network to emphasize features that are most relevant to the downstream task [16]. Therefore, this paper incorporates the SE block to amplify defect-related features while suppressing irrelevant information. The corresponding representation process is described as follows:

h_{s s e}^{m} = {SE}_{m} (h_{s}^{m}; θ_{m s e}), m \in {p, v}

(5)

h_{p s e}^{m} = {SE}_{m} (h_{p}^{m}; θ_{m s e}), m \in {p, v}

(6)

where

h_{s s e}^{m}

,

h_{p s e}^{m}

, and

θ_{m s e}

are the weighted shared representation, weighted specific representation, and network parameters of the SE block for the mth modality, respectively.

Based on the above, the joint representations of image features

h^{p}

and vibration features

h^{v}

are obtained as follows:

h^{p} = h_{s s e}^{p} + h_{p s e}^{p}

(7)

h^{v} = h_{s s e}^{v} + h_{p s e}^{v}

(8)

2.1.2. Domain-Adversarial Learning Representation

The essence of domain-adversarial learning lies in reducing the distributional differences between source and target domain data through the source domain classifier and domain discriminator, thereby capturing the modality-invariance between multimodal data. The specific representation process is as follows:

Firstly, the shared image representation

h_{s s e}^{p}

after the SE block is input into the source domain classifier C, which then calculates the source domain loss function input p, as shown below:

p = C (h_{s s e}^{p}; θ_{c})

(9)

where C is the source domain classifier and

θ_{c}

represents the network parameters.

Next, the shared representation

h_{s}^{m}

of the mth modality after the SE block is passed through the Gradient Reversal Layer (GRL) to obtain the gradient-reversed representation

q_{m}

, as follows:

q_{m} = GRL (h_{s}^{m}; θ_{g r l}), m \in {p, v}

(10)

where GRL is the gradient reversal layer, which outputs the same value as the input but with a reversed gradient direction, and

θ_{g r l}

represents the network parameters of the GRL.

Finally, the gradient reversal representation

q_{m}

is input into the domain discriminator D to compute the domain loss function input d, defined as follows:

d = {D (q_{p}; θ_{d}), D (q_{v}; θ_{d})}

(11)

where D is the domain discriminator and

θ_{d}

represents the network parameters.

2.1.3. Progressive Fusion Representation

Considering that feature extractors often exhibit deep structures and a large number of parameters, the core idea of the proposed Progressive Fusion Module is to feed the joint feature representations generated in the previous training round, together with the raw input data for the current round, into the feature extractor. This approach introduces cross-modal information into the early stages of feature extraction, enabling the model to fully leverage latent synergistic information across modalities during the extraction process. Suppose the current round is the ith round of the model training process. The specific representation process is defined as follows:

h_{s (i)}^{m} = {CAE}_{s} (x_{m}, h_{s}^{m (i - 1)}, h_{p}^{m (i - 1)}; θ_{s}), m \in {p, v}

(12)

h_{p (i)}^{m} = {CAE}_{p} (x_{m}, h_{s}^{m (i - 1)}, h_{p}^{m (i - 1)}; θ_{p}), m \in {p, v}

(13)

where

h_{s}^{m (i)}

is the shared representation of the mth modality in the ith training round, produced by the shared autoencoder;

h_{s}^{m (i - 1)}

is the SE block’s weighted shared representation of the mth modality from the

(i - 1)

-th round, generated by the shared autoencoder;

h_{p}^{m (i)}

is the modality-specific representation of the mth modality in the ith round, generated by the specific autoencoder; and

h_{p}^{m (i - 1)}

is the SE block’s weighted specific representation of the m-th modality from the

(i - 1)

-th round, produced by the specific autoencoder. Furthermore, it should be noted that if the current round is the first round of training, both

h_{s}^{m (0)}

and

h_{p}^{m (0)}

are initialized as 0.

2.1.4. Graph Attention Fusion Representation

Traditional graph neural networks often use fixed weights or rules when aggregating neighbor node information, which is not conducive to the model finding the optimal solution. The core idea of Graph Attention Networks (GATs) is to integrate attention mechanisms into graph neural networks, dynamically calculating the importance weights for each node’s neighbors in an adaptive manner and completing the information aggregation process [17]. Its aggregation effect is shown in Figure 2, and the specific representation process is as follows:

z_{i} = W h_{i}, h_{i} \in R^{F}, W \in R^{F^{'} \times F}

(14)

e_{i j} = LeakyReLU (a^{⊤} [z_{i} ‖ z_{j}]), a \in R^{2 F^{'}}

(15)

α_{i j} = \frac{exp (e_{i j})}{\sum_{k \in N_{i}} exp (e_{i k})}

(16)

h_{i}^{'} = σ (\sum_{j \in N_{i}} α_{i j} z_{j})

(17)

where

h_{i}

represents the initial feature vector of node i with dimension F; W represents the parameter matrix with dimension

F^{'} \times F

;

z_{i}

represents the feature vector of

h_{i}

after a linear transformation;

e_{i j}

represents the attention coefficient of node i for its neighboring node j; a is a learnable weight vector; ‖ denotes the concatenation operation for vectors; LeakyReLU is the activation function;

N_{i}

represents the set of neighboring nodes of node i;

α_{i j}

represents the normalized attention coefficient of node i for its neighboring node j; and

h_{i}^{'}

represents the feature vector of node i after applying attention weighting.

Rail surface defect monitoring data exhibit strong temporal correlations and dependencies across successive time steps. Emphasizing the relevant information immediately before and after the occurrence of a defect can significantly enhance the accuracy of subsequent defect type diagnosis. This paper converts the multimodal features extracted by the feature extractor into an undirected graph structure and exploits the spatial attention aggregation mechanism of graph attention networks to capture the temporal dependencies and correlations among neighboring multimodal features. The objective is to fully utilize the contextual semantic information embedded in multi-modal data to acquire a more comprehensive understanding of rail surface conditions and to achieve efficient integration of multimodal features. The detailed characterization process is described as follows. First, a vector concatenation operation is applied to the joint representation of image and vibration data features to obtain the concatenated multimodal feature representation h:

h = h^{p} ‖ h^{v}

(18)

where

h^{p}

and

h^{v}

represent the joint feature representations of image data and vibration data, respectively, and ‖ denotes the vector concatenation operation.

Secondly, the multimodal feature concatenated representation h is mapped onto an undirected graph data structure G with 50 nodes and 49 edges connected sequentially. In graph G, the feature of the ith node corresponds to the ith row of the concatenated representation h. The mapping relationship is illustrated in Figure 3. Specifically, using node 5 in graph G as an example, after two layers of spatial attention aggregation via the graph attention network, the node incorporates information from nodes 3 through 7 (both preceding and succeeding). This process effectively captures the correlations and dependencies between temporally adjacent steps in multimodal features.

Next, the graph G is passed through the graph attention network to obtain the graph attention fusion representation F:

F = GAT (G; A; θ_{g a t})

(19)

where GAT is the graph attention network; A is the adjacency list of graph G; and

θ_{g a t}

represents the parameters of the GAT network.

Subsequently, the resulting graph attention fusion representation F is fed into a multilayer perceptron (MLP) for rail surface defect diagnosis:

O = Inf (F; θ_{I n f})

(20)

where O is the rail surface defect diagnosis result, while

θ_{I n f}

represents the parameters of the multilayer perceptron (MLP) network.

2.2. Learning Strategy

To facilitate multimodal data fusion for rail surface defect diagnosis, this paper proposes a novel joint loss function as the training objective, which the model seeks to minimize. The following sections provide a detailed introduction to the designed loss function.

2.2.1. Reconstruction Loss

Reconstruction loss measures the discrepancy between reconstructed samples and the original inputs. Minimizing this loss enables the convolutional autoencoder (CAE) to learn effective features from the data. In this paper, the mean squared error (MSE) loss is selected as the reconstruction loss, and its calculation formula is as follows:

L_{recon}^{m} = {∥ x_{m} - U_{m} ∥}_{2}^{2}, m \in {c, v}

(21)

where

x_{m}

and

U_{m}

represent the original input data and the reconstructed data of the mth modality, respectively.

{∥ \cdot ∥}_{2}^{2}

denotes the squared

L_{2}

norm.

The reconstruction loss

L_{recon}

proposed in this paper is expressed as:

L_{recon} = L_{c}^{recon} + L_{v}^{recon}

(22)

2.2.2. Joint Domain Disentangled Representation Loss

The joint domain disentangled representation loss consists of two components: a similarity loss and a dissimilarity loss. The similarity loss aims to constrain the shared feature distributions of the source and target domains to be as similar as possible, thereby capturing modality-shared representations in the multimodal feature space. The dissimilarity loss enforces the domain-specific feature distributions in the multimodal feature space to be as distinct as possible, thereby capturing the modality-specific representations in the multimodal feature subspace. In consideration of the model’s stability during training, this study employs the central moment discrepancy (CMD) loss as the similarity loss and utilizes a subspace orthogonality constraint loss as the dissimilarity loss. The similarity loss

L_{sim}

, dissimilarity loss

L_{diff}

, and joint domain disentangled representation loss

L_{j}

are expressed as follows:

L_{sim} = CMD (h_{s}^{p}, h_{s}^{v})

(23)

L_{diff} = ∥ {h_{s}^{p}}^{⊤} h_{p}^{s} ∥_{F}^{2} + {∥ {h_{s}^{v}}^{⊤} h_{p}^{v} ∥}_{F}^{2}

(24)

L_{j} = L_{sim} + L_{diff}

(25)

where

{∥ \cdot ∥}_{F}^{2}

denotes the squared Frobenius norm. The CMD (central moment discrepancy) loss is expressed as follows:

CMD (p, q) = \sum_{m = 1}^{M} w_{m} {∥ μ_{p, m} - μ_{q, m} ∥}^{2}

(26)

where p and q indicate the feature distributions of the source and target domains, respectively. The terms

μ_{p, m}

and

μ_{q, m}

indicate the mth-order central moments of the feature distributions in the source domain and target domain, respectively. The symbol M denotes the highest order of the central moments under consideration, while

w_{m}

corresponds to the weight coefficient for the mth-order central moment. The notation

{∥ \cdot ∥}^{2}

signifies the squared norm.

2.2.3. Domain-Adversarial Loss

The domain-adversarial loss consists of source domain classification loss and domain discrimination loss. Since both components are classification tasks, this paper adopts the cross-entropy loss for both the source domain classification loss and the domain discrimination loss. Assuming the dataset contains N image samples and N vibration samples, the source domain classification loss

L_{c}

, the domain discrimination loss

L_{d}

, and the domain-adversarial loss

L_{a d v}

are expressed as follows:

L_{c} = - \sum_{i = 0}^{N} p_{i} log P_{i}

(27)

L_{d} = - \sum_{i = 0}^{2 N} d_{i} log D_{i}

(28)

L_{a d v} = L_{c} + L_{d}

(29)

where

p_{i}

represents the ground-truth label of the image sample;

P_{i}

denotes the decision output of the source-domain classifier for the image sample;

d_{i}

indicates the ground-truth modality label of the image and vibration data; and

D_{i}

corresponds to the modality decision output of the modality discriminator for the image and vibration data.

2.2.4. Task Loss

The rail surface defect diagnosis task is essentially a multi-class classification problem. Therefore, this study employs the cross-entropy loss as the task loss, computed as follows:

L_{task} = - \sum_{i = 0}^{N} y_{i} log Y_{i}

(30)

where

y_{i}

represents the ground-truth label of sample i, while

Y_{i}

denotes the decision output of the classification network for sample i.

2.2.5. Total Objective Loss

Based on the components described above, the joint loss function

L_{total}

designed in this paper is expressed as follows:

L_{total} = L_{recon} + L_{j} + L_{adv} + L_{task}

(31)

where the total objective is constructed by first employing the reconstruction loss

L_{recon}

to facilitate the model’s learning of effective features from the multimodal input data for preliminary feature extraction. Subsequently, the joint domain disentangled representation loss

L_{j}

is introduced to capture modality-shared and modality-specific representations within the multimodal feature subspace, thereby bridging the semantic gap between heterogeneous data sources. Thereafter, the domain-adversarial loss

L_{adv}

is utilized to further discover redundant and complementary features across the multimodal data, fully exploring commonalities and differences among them. Finally, cross-entropy loss is introduced as the task loss

L_{task}

to evaluate the model’s performance in rail surface defect diagnosis. This integrated design ensures end-to-end optimization of the model, balancing feature learning, domain adaptation, and classification accuracy.

3. Experimental Analysis

3.1. Dataset Description

To evaluate the effectiveness of the proposed PJR-GAFN, this study utilizes a real-world rail surface monitoring dataset collected from a specific railway line. The dataset encompasses four distinct rail surface maintenance conditions: normal, top surface abrasion, flaking and spalling, and wave-like wear, as illustrated in Figure 4. Notably, considering the operational speed of the track inspection vehicle and the sampling frequency of the acceleration sensors, mileage information was used to align multimodal data. Specifically, one inspection image is paired with a segment of 1000 vibration signal data points to form a multimodal rail surface monitoring data pair (as illustrated in Figure 5). The final dataset comprises two types of aligned monitoring data: image data and vibration data. It includes 800 rail surface inspection images (200 samples for each maintenance condition) and 800 corresponding vibration monitoring segments (200 segments for each maintenance condition).

Given the generally stable lighting conditions and minimal interference noise in the images—along with occasional transient impact noise in the vibration signals caused by collisions between the train and track joints—standard preprocessing was applied to both the images and vibration signals. These steps included image resizing, data normalization, and signal normalization. Finally, the dataset was divided into training and test sets in a 3:1 ratio.

3.2. Experimental Details

The experiments were conducted in the PyTorch deep learning framework with the following environment configuration: Processor, AMD (Santa Clara, CA, USA) Ryzen 9-7940HS, 4.00 GHz; GPU, NVIDIA (Santa Clara, CA, USA) GeForce RTX 3080 Laptop GPU, 12.0 GB; memory 16.0 GB; code runtime environment, PyTorch 1.12.0, Python 3.10.10. During iterative model training, the Adam optimizer was used for 600 epochs with a batch size of 50 and a learning rate of

0.5 \times 10^{- 3}

.

The architecture and parameter configuration of the proposed PJR-GAFN are illustrated in Figure 6. It is important to note that, prior to feeding the vibration data into the convolutional autoencoder, a dimensionality expansion operation is performed, converting the one-dimensional vibration data into a two-dimensional format to facilitate subsequent processing.

3.3. Comparative Experiments

To verify the effectiveness of PJR-GAFN, six comparative models were implemented in this study. These included a Fast R-CNN-based rail surface defect diagnosis model for single-modality (image) data proposed in [18], an MRC-CSN-based rail surface defect diagnosis model proposed in [19], an ISAE-LDA-SVC-based rail surface defect diagnosis model for multimodal (image and vibration) data proposed in [15], an MFDF-Net-based rail surface defect diagnosis model proposed in [5], a CNN-LSTM-SW-based rail surface defect diagnosis model proposed in [20], and an ECARRNet-based rail surface defect diagnosis model proposed in [21]. To mitigate the impact of training randomness, the average of 10 independent experimental runs was adopted as the final evaluation metric. The comparative experimental results are shown in Table 1.

From Table 1, it can be observed that the diagnostic accuracy rates of the five multimodal fusion models for rail surface defects significantly exceed the best accuracy achieved by single-modality fusion approaches, validating the effectiveness of multimodal data fusion for rail surface defect diagnosis tasks. The proposed PJR-GAFN attains a remarkable diagnostic accuracy of 95%, surpassing the reference models. This outcome indicates that PJR-GAFN, by integrating the joint domain disentangled representation module with domain-adversarial training for multi-feature extraction and incorporating a progressive fusion module, effectively exploits cross-modal latent collaborative information flows, thereby bridging the semantic gap among heterogeneous multi-source data. Moreover, harnessing the spatial attention aggregation property of graph attention networks to capture correlations and dependencies between temporally adjacent multimodal features proves highly effective for rail surface defect diagnosis tasks.

In addition to diagnostic accuracy, this paper introduces the Receiver Operating Characteristic (ROC) curve to further evaluate the performance of the five multimodal fusion models. The vertical axis of the ROC curve denotes the True Positive Rate (TPR), while the horizontal axis represents the False Positive Rate (FPR). The area enclosed by the ROC curve and the coordinate axes is termed the Area Under the Curve (AUC), where a higher AUC value indicates superior model classification performance. The ROC curves for the five multimodal fusion models are shown in Figure 7.

In Figure 4, Figure 5, Figure 6 and Figure 7, C0, C1, C2, and C3 represent samples with normal rail surfaces, top-surface abrasion, spalling, and corrugation, respectively. The micro-average and macro-average metrics collectively reflect the model’s overall diagnostic performance across the entire sample set. In an analysis of these micro-average and macro-average curves, the proposed PJR-GAFN model achieves the highest AUC value, demonstrating its superior performance in rail surface defect diagnosis.

3.4. Ablation Study

To evaluate the effectiveness of each module in the proposed PJR-GAFN, we conduct quantitative ablation studies from two perspectives: network architecture and loss function. Five experiments are included: (1) rail surface defect diagnosis without the squeeze-and-excitation (SE) module; (2) rail surface defect diagnosis without the progressive fusion module; (3) rail surface defect diagnosis without the graph attention fusion module; (4) rail surface defect diagnosis without the domain-adversarial loss

L_{adv}

; and (5) rail surface defect diagnosis without the joint domain disentangled representation loss

L_{j}

. The classification accuracy of each ablation variant is listed in Table 2.

From Table 2, it can be observed that the omission of either the squeeze-and-excitation (SE) module or the progressive fusion module results in a 1.5% reduction in rail surface defect diagnosis accuracy. This finding demonstrates that adaptively amplifying defect-related features while suppressing irrelevant features through the SE module is effective and that providing early cross-modal information guidance via the progressive fusion module also benefits the task. Furthermore, the absence of the graph attention fusion module leads to a 10% drop in diagnostic accuracy, highlighting the significant role of capturing correlations and dependencies between temporally adjacent multimodal features through the spatial attention aggregation property of graph attention networks. Finally, the removal of the domain-adversarial loss

L_{adv}

or the joint domain disentangled representation loss

L_{j}

reduces accuracy by 2% and 3%, respectively, confirming that constraining the distribution of multimodal features in the feature subspace through

L_{adv}

and

L_{j}

is critical for enhancing model performance.

3.5. Model Explainability Analysis

To further validate the efficacy of the proposed PJR-GAFN, this section employs the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm to perform dimensionality reduction in order to visualize the outputs from the model’s joint domain disentangled representation layer, SE block layer, GAT fusion layer, and decision layer. The visualization results are presented in Figure 8.

In Figure 8, the four distinct point shapes (circles, pentagrams, triangles, and squares) denote the four different rail surface maintenance conditions: normal rail surfaces, top surface abrasion, spalling, and corrugation. Examination of the output from the joint domain dis-entangled representation layer reveals that the feature distributions for the four rail surface states are relatively concentrated, indicating that this layer effectively bridges the semantic gap among multi-source heterogeneous data. Observation of the SE block layer output shows that the feature distributions become even more clustered compared to those from the joint domain disentangled representation layer, demonstrating that the channel attention weighting mechanism in the SE block effectively constrains the distribution of rail surface features. Moreover, analysis of the GAT fusion layer output indicates that spalling samples are well-separated, whereas the other rail surface states exhibit minimal overlap, thereby confirming the efficacy of capturing correlations and dependencies between temporally adjacent multimodal features via the spatial attention aggregation property of GAT. Finally, inspection of the decision layer output reveals that while normal rail surface, top surface abrasion, and corrugation samples exhibit slight overlap, spalling samples are accurately classified, thereby validating the efficacy of the proposed model.

3.6. Generalization Experiment

To evaluate the generalization capability of the proposed PJR-GAFN, this study employs the publicly available bearing damage current and vibration dataset from the University of Paderborn, Germany [22], for model testing. This dataset includes four categories of bearing faults: normal bearings, inner race faults, outer race faults, and combined inner and outer race faults. In this experiment, one set of data from intact bearings and four sets of real-world data from faulty bearings were selected under operating conditions of rotational speed N = 1500 r/min, load torque T = 0.7 N·m, and radial force F = 1000 N. Detailed parameters information is provided in Table 3. During data preprocessing, a total of 160,000 data points were extracted from the five datasets. These were divided into 400 non-overlapping samples with a step size of 400. The dataset was then divided into training and test sets at a ratio of 3:1. The experimental results are summarized in Table 4.

As shown in Table 4, the proposed PJR-GAFN achieves a bearing fault diagnosis accuracy of 99.8% on the publicly available bearing damage current and vibration dataset from the University of Paderborn, Germany, significantly outperforming other multimodal fusion models. This result demonstrates the proposed model’s strong multimodal data fusion capability and excellent generalization performance.

In addition to conventional diagnostic accuracy metrics, this section further evaluates the model’s generalization ability by analyzing the ROC curves of the multimodal fusion models listed in Table 4. The ROC curves for the five models are presented in Figure 9.

In Figure 9, C0, C1, C2, and C3 correspond to samples of normal bearings, inner race faults, outer race faults, and combined inner and outer race faults, respectively. The micro-average and macro-average represent the microscopic and macroscopic performance across all classes, reflecting the overall diagnostic capability of the model. As illustrated in the figure, the proposed PJR-GAFN achieves the highest AUC, indicating superior diagnostic performance and further confirming its strong generalization capability.

4. Conclusions

This paper proposes a Progressive Joint Representation Graph Attention Fusion Network (PJR-GAFN) for rail surface defect diagnosis. The network comprises two core components: the Progressive Joint Domain Separation Representation Module and the Graph Attention Fusion Inference Module. The Progressive Joint Domain Separation Representation Module first extracts both shared and modality-specific representations from each modality using a shared autoencoder and a modality-specific autoencoder to construct a joint feature representation. A squeeze-and-excitation (SE) block is then applied to further amplify defect-relevant features while suppressing irrelevant ones. A progressive fusion strategy is subsequently employed to ensure full exploitation of latent synergistic information across modalities during the feature extraction process. In addition, a source domain classifier and a domain discriminator are incorporated to capture modality-invariant features from different modalities. The Graph Attention Fusion Inference Module leverages the spatial attention aggregation capability of the Graph Attention Network (GAT) to fuse image and vibration features, fully capturing contextual semantic information from multiple modalities for a more comprehensive representation of rail surface conditions. A multilayer perceptron (MLP) is then used to make the final defect classification. To validate the effectiveness of the proposed PJR-GAFN, a series of comparative, ablation, and generalization experiments were conducted. It is worth noting that the rail surface monitoring data used in this study were manually curated to remove segments with obvious noise, resulting in a dataset of sufficient size and relatively high quality. However, in real-world railway scenarios, defects such as fish-scale cracks, transverse cracks, and longitudinal cracks are generally much less frequent than common defects such as spalling, block shedding, and abrasion. Therefore, performing defect diagnosis based on multimodal data fusion under the conditions of a small sample and noisy data remains a significant challenge. In future work, we aim to explore this issue further.

Author Contributions

Conceptualization, Z.W.; funding acquisition, C.Z.; methodology, Z.W., S.P., and W.A.; software, S.P. and W.A.; validation, S.P. and W.A.; writing—original draft, S.P. and W.A.; writing—review and editing, J.L. and C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Key R&D Program of China (2021YFF0501101), the National Natural Science Foundation of China (Youth Fund Project, Grant 62106074), the National Natural Science Foundation of China (52272347), and the National Science Fund of Hunan (2024JJ7132).

Data Availability Statement

The data presented in this study are available on request from the corresponding author, as the data in this study come from a national key project.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fenling, F.; Ailan, L.; Qiying, H.; Yinchen, Z. Strategy analysis on railway insured transportation and freight insurance cooperative development based on Hotelling model. J. Railw. Sci. Eng. 2023, 20, 356–363. [Google Scholar] [CrossRef]
Xu, Y.; Liu, J.; Li, X.; Tang, C. An Investigation of Vibrations of a Flexible Rotor System with the Unbalanced Force and Time-Varying Bearing Force. Chin. J. Mech. Eng. 2025, 38, 25. [Google Scholar] [CrossRef]
Binder, M.; Mezhuyev, V.; Tschandl, M. Predictive Maintenance for Railway Domain: A Systematic Literature Review. IEEE Eng. Manag. Rev. 2023, 51, 120–140. [Google Scholar] [CrossRef]
Li, Z.; Bai, Q.; Wang, F.; Liu, R. Real-Time Detection System of Rail Surface Defects Based on Semantic Segmentation. Comput. Eng. Appl. 2021, 57, 248–256. [Google Scholar]
Shen, Y.; Zhong, Q.; Zheng, S.; Li, L.; Peng, L. A Multi-Modal Approach to Rail Surface Condition Analysis: The MFDF-Net. IEEE Access 2024, 12, 132480–132494. [Google Scholar] [CrossRef]
Chen, Z.; Wang, Q.; He, Q.; Yu, T.; Zhang, M.; Wang, P. CUFuse: Camera and Ultrasound Data Fusion for Rail Defect Detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21971–21983. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [Google Scholar] [CrossRef] [PubMed]
Ge, X.; Cao, Z.; Qin, Y.; Gao, Y.; Lian, L.; Bai, J.; Yu, H. An Anomaly Detection Method for Railway Track Using Semisupervised Learning and Vision-Lidar Decision Fusion. IEEE Trans. Instrum. Meas. 2024, 73, 1–15. [Google Scholar] [CrossRef]
Yang, J.; Zhou, W.; Wu, R.; Fang, M. CSANet: Contour and Semantic Feature Alignment Fusion Network for Rail Surface Defect Detection. IEEE Signal Process. Lett. 2023, 30, 972–976. [Google Scholar] [CrossRef]
Wang, Q.; Wang, X.; He, Q.; Huang, J.; Huang, H.; Wang, P.; Yu, T.; Zhang, M. 3D tensor-based point cloud and image fusion for robust detection and measurement of rail surface defects. Autom. Constr. 2024, 161, 105342. [Google Scholar] [CrossRef]
Piao, Y.; Ji, W.; Li, J.; Zhang, M.; Lu, H. Depth-Induced Multi-Scale Recurrent Attention Network for Saliency Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Zhao, H.; Zhao, J.; Zhao, X.; Wang, S.; Li, Y. Rail Surface Defect Method Based on Bimodal-Modal Deep Learning. Comput. Eng. Appl. 2023, 59, 285–293. [Google Scholar]
Liu, N.; Zhang, N.; Han, J. Learning Selective Self-Mutual Attention for RGB-D Saliency Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Yang, T.; Xu, T.; Cheng, Y.; Tang, Z.; Su, S.; Cao, Y. A fusion method based on 1D vibration signals and 2D images for detection of railway surface defects. In Proceedings of the 2023 3rd International Conference on Neural Networks, Information and Communication Engineering (NNICE), Guangzhou, China, 24–26 February 2023; pp. 282–286. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar]
Choi, J.Y.; Han, J.M. Deep Learning (Fast R-CNN)-Based Evaluation of Rail Surface Defects. Appl. Sci. 2024, 14, 1874. [Google Scholar] [CrossRef]
Wang, M.; Zhou, Y. Autonomous Rail Surface Defect Identification Based on an Improved One-Stage Object Detection Algorithm. J. Perform. Constr. Facil. 2024, 38, 04024041. [Google Scholar] [CrossRef]
Rahman, M.A.; Jamal, S.; Taheri, H. Remote condition monitoring of rail tracks using distributed acoustic sensing (DAS): A deep CNN-LSTM-SW based model. Green Energy Intell. Transp. 2024, 3, 100178. [Google Scholar] [CrossRef]
Eunus, S.I.; Hossain, S.; Ridwan, A.E.M.; Adnan, A.; Islam, M.S.; Karim, D.Z.; Alam, G.R.; Uddin, J. ECARRNet: An Efficient LSTM-Based Ensembled Deep Neural Network Architecture for Railway Fault Detection. AI 2024, 5, 482–503. [Google Scholar] [CrossRef]
Lessmeier, C.; Kimotho, J.K.; Zimmer, D.; Sextro, W. Condition monitoring of bearing damage in electromechanical drive systems by using motor current signals of electric motors: A benchmark data set for data-driven classification. In Proceedings of the PHM Society European Conference, Chengdu, China, 19–21 October 2016; Volume 3. [Google Scholar]

Figure 1. Structureof PJR-GAFN.

Figure 2. Schematicchart of GAT information aggregation.

Figure 3. Schematic chart of graph data structure mapping.

Figure 4. Four different rail surface maintenance states.

Figure 5. The correspondence between the image and vibration modalities.

Figure 6. Structureand parameters of PJR-GAFN.

Figure 7. Visualization of ROC curves.

Figure 8. t−SNE dimensionality reduction visualization.

Figure 9. ROC curves of generalization experiment.

Table 1. Comparative experiment.

Model Type	Data Type	Method	Accuracy/%
Single-Modality Non-Fusion	Image	Fast R-CNN [18]	83.0
Single-Modality Non-Fusion	Image	MRC-CSN [19]	83.5
Multimodal Fusion	Image, Vibration	ISAE-LDA-SVC [15]	93.5
	Image, Vibration	MFDF-Net [5]	91.5
	Image, Vibration	CNN-LSTM-SW [20]	91.0
	Image, Vibration	ECARRNet [21]	94.0
	Image, Vibration	PJR-GAFN	95.0

Table 2. Ablation experiment.

Experiment Type	Ablated Component	Accuracy (%)
Network Structure Ablation	Missing Squeeze-and-Excitation (SE) Module	93.5
	Missing Progressive Fusion Module	93.5
	Missing Graph Attention Fusion Module	85.0
Loss Function Ablation	Missing Domain-Adversarial Loss $L_{adv}$	93.0
Loss Function Ablation	Missing Joint Domain Disentangled Representation Loss $L_{j}$	92.0
	Baseline Model	95.0

Table 3. Dataparameters for generalization experiments.

Serial Number	Type	Position	Form	Degree
K001	Normal	–	–	–
KA04	Bearing Fault	Outer Ring	Single Point	1
KA15	Plastic Deformation	Outer Ring	Single Point	1
KB23	Bearing Fault	Inner and Outer Rings	Multiple Points	2
KI21	Bearing Fault	Inner Ring	Single Point	1

Table 4. Generalization experiment.

Model Type	Data Type	Method	Accuracy (%)
Multimodal Fusion	Electrical, Vibration	ISAE-LDA-SVC	97.0
	Electrical, Vibration	MFDF-Net	98.0
	Electrical, Vibration	CNN-LSTM-SW	96.8
	Electrical, Vibration	ECARRNet	98.2
	Electrical, Vibration	PJR-GAFN	99.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Peng, S.; Ao, W.; Liu, J.; Zhang, C. Rail Surface Defect Diagnosis Based on Image–Vibration Multimodal Data Fusion. Big Data Cogn. Comput. 2025, 9, 127. https://doi.org/10.3390/bdcc9050127

AMA Style

Wang Z, Peng S, Ao W, Liu J, Zhang C. Rail Surface Defect Diagnosis Based on Image–Vibration Multimodal Data Fusion. Big Data and Cognitive Computing. 2025; 9(5):127. https://doi.org/10.3390/bdcc9050127

Chicago/Turabian Style

Wang, Zhongmei, Shenao Peng, Wenxiu Ao, Jianhua Liu, and Changfan Zhang. 2025. "Rail Surface Defect Diagnosis Based on Image–Vibration Multimodal Data Fusion" Big Data and Cognitive Computing 9, no. 5: 127. https://doi.org/10.3390/bdcc9050127

APA Style

Wang, Z., Peng, S., Ao, W., Liu, J., & Zhang, C. (2025). Rail Surface Defect Diagnosis Based on Image–Vibration Multimodal Data Fusion. Big Data and Cognitive Computing, 9(5), 127. https://doi.org/10.3390/bdcc9050127

Article Menu

Rail Surface Defect Diagnosis Based on Image–Vibration Multimodal Data Fusion

Abstract

1. Introduction

2. Model Overview

2.1. Model Structure

2.1.1. Joint Domain Separation Representation

2.1.2. Domain-Adversarial Learning Representation

2.1.3. Progressive Fusion Representation

2.1.4. Graph Attention Fusion Representation

2.2. Learning Strategy

2.2.1. Reconstruction Loss

2.2.2. Joint Domain Disentangled Representation Loss

2.2.3. Domain-Adversarial Loss

2.2.4. Task Loss

2.2.5. Total Objective Loss

3. Experimental Analysis

3.1. Dataset Description

3.2. Experimental Details

3.3. Comparative Experiments

3.4. Ablation Study

3.5. Model Explainability Analysis

3.6. Generalization Experiment

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI