Railway Signal Relay Voiceprint Fault Diagnosis Method Based on Swin-Transformer and Fusion of Gaussian-Laplacian Pyramid

Liu, Yi; Chen, Liang; Wang, Zhen; Zhou, Shangmin; Zhao, Bobo

doi:10.3390/math13233846

Open AccessArticle

Railway Signal Relay Voiceprint Fault Diagnosis Method Based on Swin-Transformer and Fusion of Gaussian-Laplacian Pyramid

by

Yi Liu

¹,

Liang Chen

²,

Zhen Wang

³,

Shangmin Zhou

¹ and

Bobo Zhao

^1,*

¹

School of Automation and Intelligence, Beijing Jiaotong University, Beijing 100044, China

²

Railway Science & Technology Research & Development Center China Academy of Railway Sciences Corporation Limited, Beijing 100000, China

³

Hefei Signaling and Telecommunications Section of China Railway Shanghai Group Co., Ltd., Hefei 230000, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(23), 3846; https://doi.org/10.3390/math13233846 (registering DOI)

Submission received: 20 October 2025 / Revised: 19 November 2025 / Accepted: 21 November 2025 / Published: 1 December 2025

(This article belongs to the Special Issue Advances in Perception, Control and Optimization Methods in Intelligent Transportation Systems)

Download

Browse Figures

Versions Notes

Abstract

Fault diagnosis of railway signal relays is crucial for the operational safety and efficiency of railway systems. With the continuous advancement of deep learning techniques in various applications, voiceprint-based fault diagnosis has emerged as a research hotspot, facilitating the transition from failure-based repair to condition-based maintenance. However, this approach still faces challenges such as the limited feature extraction capability of single voiceprint features and poor discriminability when features are highly concentrated. To address these issues, this paper proposes a voiceprint-based fault diagnosis method for railway signal relays that utilizes a Gaussian–Laplacian pyramid fusion rule and an improved Swin Transformer. The enhanced Swin Transformer integrates the original architecture with a saliency feature map as a masking strategy. Experimental results demonstrate that the proposed method, based on the Gaussian–Laplacian pyramid fusion rule and the improved Swin Transformer, reduces the number of parameters by 54.8% compared to the Vision Transformer while the accuracy is almost same.

Keywords:

artificial intelligence; convolutional neural network; voiceprint fault diagnosis; Swin-Transformer; Gaussian-Laplacian pyramid fusion; relay

MSC:

68T07

1. Introduction

Railway signaling systems utilize a large number of signal relays, which perform crucial functions such as signal transmission and interlocking control, the proper functioning of these relays significantly impacts the safe and stable operation of trains. Although technological advancements and improved quality management have significantly reduced the probability of modern railway signal relay failures, a single malfunction leading to incorrect relay operation or failure can trigger a chain reaction, affecting the normal operation of the entire signaling system. Even a minor fault in a single relay can have serious operational consequences. Furthermore, the advancement of next-generation train control systems such as ATCS [1,2] from research to application also relies heavily on the critical reliability provided by relays. Therefore, ensuring the safe and reliable operation of railway signal relays is of paramount importance.

In railway signaling systems, relay safety assurance technologies can be broadly divided into two key areas: real-time monitoring and periodic maintenance. For real-time monitoring, a microcomputer-based monitoring system is currently widely used. This system collects real-time data on relay contact status and compares it with control commands to verify drive consistency. This monitoring method can promptly detect significant faults during relay operation and automatically generates alarms, providing a basis for troubleshooting. However, existing monitoring technologies struggle to effectively detect mechanical conditions of relays, such as contact sticking and dust accumulation. Furthermore, because microcomputer-based monitoring is an indirect method, it can typically only identify abnormal signals, making it difficult to precisely locate and trace the cause of a fault. Currently, railway signal relays generally utilize a fixed-cycle maintenance model, where relays are periodically disassembled in batches and sent to specialized repair shops for cleaning, adjustment, and testing. Although this model has guaranteed the reliability of the equipment to a certain extent, it also has the following obvious disadvantages: (1) The fixed-cycle maintenance schedule is difficult to accurately match the actual degradation process of the relay, which can easily lead to over-maintenance or under-maintenance; (2) Large-scale disassembly and maintenance not only consumes a lot of manpower and materials, but also significantly increases maintenance costs; (3) Due to the limitations of maintenance capabilities, some relays may have maintenance blind spots due to incomplete detection coverage, which poses a safety hazard. In addition, frequent disassembly and installation processes may also introduce new mechanical damage risks. It is worth noting that the sound signals generated by the relay during operation contain rich working status information and can directly reflect the health status of its mechanical structure. The use of voiceprint monitoring technology, as a non-contact detection method, can significantly improve the status monitoring and maintenance efficiency of railway signal relays, providing the possibility of more accurate and economical predictive maintenance. The four most common faults in railway signal relays are contact deformation, contact oxidation, foreign matter remaining on the contacts, and coil short circuit. The specific fault locations are shown in Figure 1.

Researchers have long studied this non-contact voiceprint fault diagnosis technology, and its main application areas are bearing fault diagnosis [3], speech recognition [4,5,6], and transformer fault diagnosis [7,8,9,10]. Because voiceprint detection technology has some advantages, this paper aims to study its voiceprint fault diagnosis technology for railway signal relays.

1.1. Related Work

In most studies, the technical solution for diagnosing equipment faults through voiceprint is mostly completed in two steps: feature extraction and model classification. The application of feature extraction in the field of voiceprint fault recognition can improve the accuracy of fault recognition, thereby greatly improving the reliability of equipment. P. Mao et al. [11] proposed a voiceprint recognition model for hydropower units based on Mel-spectrum and convolutional neural network (CNN), which mainly completes the feature extraction of voiceprints of hydropower units through Mel-spectrum. R. Mushi et al. [12] used deep convolutional neural network to evaluate the features of Mel filter library for sound classification, and generated audio data by random sampling method after filtering audio noise through pre-emphasis filter, and converted it into Mel filter bank features to obtain feature vectors. S. E. Shia et al. [13] developed an efficient speech disorder detection system using wavelet transform and feedforward neural network. Normal and abnormal speech obtained from SVD speech database were decomposed by one-dimensional discrete wavelet and the energy of wavelet subband coefficients was calculated to form speech feature vectors. L. Xiong et al. [14] used deep learning networks to achieve text-independent voiceprint recognition, extracting the sound perception features Mel frequency cepstral coefficients (MFCC) and their first-order differential feature components (ΔMFCC), and combining them with feature vectors; through principal component analysis (PCA) and data normalization methods, the extracted voiceprint feature matrix was compressed and input into long short-term memory network (LSTM) and bidirectional LSTM network models for voiceprint target classification.

In summary, Table 1 is a comparison table that summarizes the main feature extraction methods cited, along with their main advantages and limitations.

Image fusion can give full play to the advantages of the two voiceprint feature extraction methods in feature extraction and further improve the accuracy of fault diagnosis. S. Parisotto et al. [15] proposed a new nonlinear image fusion variational model, which realizes visually reasonable image data fusion based on the minimization of nonconvex energy proposed by the penetration energy term, and remains unchanged for multiplicative brightness changes. At the same time, it requires minimal supervision and parameter adjustment, and can encode the prior information of the image structure to be fused. O. Prakash et al. [16] proposed a pixel-level image fusion scheme based on multi-resolution biorthogonal wavelet transform. The wavelet coefficients of different decomposition levels are fused using the absolute maximum fusion rule. The fused image has no noise and also has additional Gaussian white noise. At the same time, the fusion quality is improved by reducing the loss of important information available in a single image. L. Mingjing et al. [17] proposed a contrast pyramid image fusion rule based on local energy and image gradient, and verified the feasibility of the proposed fusion rule.

Fault classification models play a decisive role in the task of voiceprint fault identification. E. K. Gulsoy et al. [18] studied two datasets widely used in remote sensing and evaluated the classification performance of the Swin Transformer model and popular research models in the literature. After applying the 5 cross-validation method to the dataset, the Swin Transformer model achieved an accuracy of 95.39%, which is better than the classic CNN, Ensemble and GAN models. X. Zhang et al. [19] proposed a novel dual-channel deep learning architecture, which uses Swin Transformer V2 to integrate a one-dimensional Transformer with a two-dimensional wavelet time-frequency map. It utilizes the powerful function of the one-dimensional Transformer network to process the original vibration signal and effectively capture the inherent time dependence and potential fault modes of the mechanical system. At the same time, it uses continuous wavelet transform to convert the time domain signal into a two-dimensional time-frequency map, and then analyzes it through Swin Transformer V2 to extract image features. This dual-channel framework integrates information from time series and frequency domain images and significantly enhances the fault detection performance in mechanical systems. H. Dou et al. [20] proposed a dual-channel fault diagnosis fusion model that integrates the Swin Transformer and ResNet architectures. The model generates the time-frequency representation of the fault signal through continuous wavelet transform (CWT) and uses variational mode decomposition (VMD) to decompose the original one-dimensional vibration signal into different intrinsic mode functions (IMFs), extracting statistical features to form feature vectors. These processed vectors are then concatenated with the original vibration signal to construct one-dimensional feature samples. Finally, the one-dimensional feature samples and wavelet time-frequency images are input into the ResNet and Swin Transformer models respectively for training, and the outputs of the two models are fused at the decision level, achieving excellent diagnostic performance.

Due to the special nature of the sound of relay action, it is also very important to improve the model’s attention to input features. F. Fan et al. [21] proposed an improved and efficient Faster R-CNN algorithm for PCB solder joint defects and component detection. This study shows that convolutional layers can effectively complete the construction of saliency maps because the convolution operation itself has the ability to extract local features and model spatial hierarchical structures. Through the sliding scan of multi-level convolutional kernels, it can gradually capture diverse features from low-level texture to high-level semantics, and on this basis generate a saliency distribution that reflects the key regions in the image.

Research on equipment fault diagnosis based on acoustic prints generally follows a technical approach that combines feature extraction with model classification. In terms of feature extraction, acoustic features such as Mel spectrum, Mel filter bank, wavelet transform, and Mel frequency cepstral coefficients have been widely used to characterize equipment status. By combining these features with models such as convolutional neural networks, feedforward networks, and long short-term memory networks, the accuracy of fault identification has been effectively improved. To further enhance feature representation capabilities, multi-source information fusion methods have received widespread attention. Among them, image fusion techniques based on variational models, bioorthogonal wavelet transform, and contrast pyramids achieve synergistic enhancement of multimodal features while preserving key information. Regarding classification models, the Swin Transformer, with its superior modeling capabilities, has demonstrated performance superior to traditional convolutional networks and ensemble methods in multiple diagnostic tasks. Its derived dual-channel architecture and fusion model with ResNet further expand its potential in time-frequency joint analysis. Furthermore, considering the significant transient features in relay acoustic prints, convolution-based saliency detection methods have also been proven to effectively guide the model to focus on key fault regions.

The literature review reveals a clear direction for the development of voiceprint-based relay fault diagnosis technology: voiceprint analysis is shifting from the time domain to time-spectrum maps, feature extraction methods are shifting from single to multiple types, feature fusion methods are shifting from simple splicing to deep fusion, and fault diagnosis models are shifting from general to specific diagnostic tasks.

1.2. Research Results

Building on existing research, this paper proposes a feature extraction method based on the fusion of MFCC and CWT. MFCC offers excellent performance in frequency domain representation, while CWT provides high-resolution signal representation in the time-frequency domain. The complementary fusion of the two can effectively improve feature discriminability, achieving a “1 + 1 > 2” feature enhancement effect. To this end, we further designs a fusion rule based on the Gaussian-Laplacian pyramid to integrate the advantages of both feature extraction methods.

After obtaining the fused feature image, The Swin Transformer is adopted as the backbone network, due to its excellent classification performance and lower computational complexity compared to models like ViT and LSTM. To further enhance the model’s ability to perceive key features, a saliency guidance mechanism is introduced to enhance the model’s performance. This module functions by highlighting salient fault-related features while suppressing interference from irrelevant background information, thereby improving the model’s discriminative capability and robustness.

1.3. Paper Structure

To systematically elucidate a railway signal relay acoustic fingerprint fault diagnosis method based on the fusion of Swin-Transformer and Gaussian-Laplace pyramid, this paper follows the following structure: First, the introduction clarifies the research background of relay fault diagnosis, existing acoustic fingerprint diagnosis technologies, existing acoustic fingerprint feature extraction methods, and image fusion methods, defining the core objective of this research. Then, the second section details the feature fusion rules of MFCC and CWT based on the Gaussian-Laplace pyramid. Next, the third section explains the design principle of the I-Swin model and its prototype, the Swin-Transformer model. Following this, the fourth section introduces the dataset information, experimental equipment setup, and experimental parameter settings, and verifies the effectiveness and superiority of the proposed method through ablation experiments and comparative experiments. Finally, the fifth section summarizes the research work and discusses the value of the method and future improvement directions.

2. Feature Extraction and Fusion

In the field of voiceprint feature extraction, MFCC effectively extract key features from acoustic signals by simulating the auditory characteristics of the human ear, while CWT excels at capturing transient details in non-stationary signals. Existing fusion methods are mostly limited to simple feature splicing or decision-level fusion, failing to achieve deep feature interaction at multiple resolution levels. This study adopts the Laplace pyramid framework of Gauss, and by injecting the auditory attributes of MFCC and the time-frequency details of CWT in parallel at each level of the pyramid, it achieves structured fusion of the two features in a unified scale space.

2.1. MFCC

The MFCC feature extraction process is similar to the Mel-spectrogram based on the Mel-scale filter bank, but it further deepens the processing based on the Mel-spectrogram. The processing process includes five steps: framing, windowing, FFT, Mel-filtering, and DCT.

2.1.1. Framing

The core theoretical basis of the MFCC algorithm is the assumption of “short-term stationarity”. Transient signals fluctuate violently, and frame segmentation and windowing operations are used to preserve them in each analysis window. Assuming that the number of relay action voiceprint samples is

N

, the frame length is

L

, and the frame displacement is

S

, then the formula for calculating the number of frames

M

is:

M = [\frac{N - L}{S}] + 1,

(1)

2.1.2. Windowing

To suppress spectral leakage, this paper employs a Hamming window, which has stronger sidelobe attenuation capabilities, to window each frame of the signal. Its expression is:

w [n] = \{\begin{array}{l} 0.54 + 0.46 \cos (\frac{2 π n}{L - 1}), & 0 \leq n \leq L - 1 \\ 0, & o t h e r w i s e \end{array},

(2)

o_{t} [n] = y_{t} [n] \times w [n],

(3)

where t represents the sequence number of the frame signal, t = 1, 2, …, N.

w [n]

represents the Hamming window function,

o_{t} [n]

represents the t-th frame signal after windowing processing.

2.1.3. FFT

Perform an FFT transform on each frame of the signal to obtain the spectrum of each frame, and arrange the spectrum vectors to obtain the FFT transform matrix of the voiceprint sample:

X_{t} [k] = \sum_{n = 0}^{L - 1} o_{t} [n] e^{- j \frac{2 π}{L} k n},

(4)

In the formula, t represents the sequence number of the frame signal, t = 1, 2, …, N.

o_{t} [n]

A represents the t-th frame signal after windowing. k represents the frequency partition, retaining only the positive half of the frequency band, k = 0, 1, …, L/2.

After obtaining the FFT transformation matrix, the power matrix is obtained by taking the average of its square modulus

P_{t} [k]

:

P_{t} [k] = \frac{{‖X_{t} [k]‖}^{2}}{L},

(5)

2.1.4. Mel Filter

The Mel filter bank maps the spectrum to the Mel scale; the mathematical expression for this transformation is:

f_{m e l} = 2595 \log_{10} (1 + \frac{f}{700}) = M el (f),

(6)

where

f_{m e l}

is the Mel frequency and

f

is the actual frequency.

A Mel filter bank is typically composed of M Mel filters. Assume that the center frequency, start frequency, and cutoff frequency of the Mel filters are

f (m)

,

f (m - 1)

, and

f (m + 1)

, respectively. In the Mel frequency domain, the center frequency of each filter is equally spaced. The formula for calculating the center frequency is:

f (m) = (\frac{L}{f s}) M e l^{- 1} [M e l (f_{l}) + m \frac{M e l (f_{h}) - M e l (f_{l})}{M + 1}],

(7)

where

f s

is the audio sampling rate,

f_{l}

and

f_{h}

represent the lowest and highest frequencies of the audio respectively, and the value of

f_{l}

is 0.

The value of

f s

is set according to Shannon’s sampling theorem, that is, it satisfies:

2 f_{h} \leq f s,

(8)

Therefore, the value of

f_{h}

is set to half of

f s

.

The construction formula of Mel filter bank is:

H_{m} (k) = \{\begin{array}{l} 0, & k < f (m - 1) \\ \frac{k - f (m - 1)}{f (m) - f (m - 1)}, & f (m - 1) \leq k \leq f (m) \\ \frac{f (m + 1) - k}{f (m + 1) - f (m)}, & f (m) < k \leq f (m + 1) \\ 0, & k > f (m + 1) \end{array},

(9)

Finally, the Mel filter bank is applied to each frame of the power matrix and the logarithm is taken to obtain the output of the Mel filter.

W_{m e l} (m) = \ln [\sum_{k = 0}^{L / 2 - 1} H_{m} (k) P_{t} (k)],

(10)

2.1.5. DCT

To separate the excitation source sound from the vocal tract response, a discrete cosine transform is introduced, and its calculation formula is as follows:

M F C C (n) = \sum_{m = 1}^{M} W_{m e l} \cos [\frac{n π}{M} (m - \frac{1}{2})] .

(11)

where M is the number of Mel filters.

2.2. CWT

Wavelet transform is a time-frequency analysis method characterized by its ability to perform multi-scale analysis of signals using time-frequency windows of variable size. Its processing mainly includes two steps: selection of wavelet basis functions and calculation of wavelet coefficients.

2.2.1. Wavelet Basis Function Selection

This paper chooses the Morlet wavelet as the basis function for analysis, mainly because it is particularly suitable for time-frequency analysis. Its mathematical expression is:

ψ (t) = π^{- \frac{1}{4}} e^{i ω_{0} t} e^{- \frac{t^{2}}{2}},

(12)

Among them,

ψ (t)

represents the Morlet wavelet basis function, and

ω_{0}

is the center frequency, which is usually greater than or equal to 5.

2.2.2. Wavelet Coefficient Calculation

Wavelet transform decomposes a signal into different scales and time positions by scaling and translating the basis functions. Its mathematical expression is as follows:

W T (α, τ) = \frac{1}{\sqrt{α}} \int_{- \infty}^{\infty} f (t) ψ (\frac{t - τ}{α}) d t .

(13)

where

α

represents the stretch factor,

τ

represents the translation factor, and

f (t)

represents the voiceprint signal. Due to the transient nature of relay operating sounds, low-frequency components do not affect feature extraction. Therefore, the lowest frequency of the basis function is set to 10 Hz. Furthermore, according to the Shannon sampling theorem, the maximum frequency of the basis function is half the acquisition rate, which is used as the basis for setting the stretch factor.

The CWT process generates a distribution map of signal energy on a two-dimensional time-scale plane based on frequency scale, time, and matching degree.

2.3. Gaussian-Laplacian Pyramid

An image fusion algorithm based on multiscale pyramid decomposition is proposed to effectively fuse the CWT and MFCC spectrograms derived from the same audio signal. The algorithm constructs a four-layer Gaussian pyramid and a Laplacian pyramid to extract the detailed texture of the CWT image and the contour structure of the MFCC image at different scales. It then employs a gradient-based fusion algorithm for feature selection and reconstruction, ultimately generating a fused image that combines the advantages of both.

2.3.1. Gaussian Pyramid

The process of constructing an image Gaussian pyramid is mainly divided into three core steps, and the specific process is shown in Figure 2.

First, based on the preset standard deviation parameters in the two-dimensional Gaussian function, a discretized two-dimensional Gaussian filter matrix, namely the Gaussian kernel, is calculated and generated. The mathematical formula of the probability distribution function of the two-dimensional Gaussian function is:

G (x, y) = \frac{1}{2 π σ^{2}} e^{- \frac{x^{2} + y^{2}}{2 σ^{2}}},

(14)

where

σ

is the standard deviation.

The second step is to convolve the Gaussian kernel with the original image to achieve low-pass filtering, effectively suppressing high-frequency details and noise in the image. This step is intended to preprocess the subsequent sampling operation and prevent distortion caused by spectral aliasing. Finally, the filtered image is down-sampled, removing a row and a column of pixels at intervals, thereby reducing the image size to half of its original size. The specific expression is:

G_{i} = D O W N (G_{i - 1} \otimes g_{n * n}),

(15)

Among them,

G_{i}

is the image of the i-th layer of the Gaussian pyramid,

D O W N ()

represents the down-sampling process,

\otimes

represents the convolution calculation, and

g_{n * n}

represents the n*n Gaussian kernel.

2.3.2. Laplace Pyramid

The Laplacian pyramid can be understood as the inverse process of the Gaussian pyramid. However, during the construction of the Laplacian pyramid, high-frequency detail information lost in the convolution and down-sampling operations of the Gaussian pyramid can be obtained. Each layer of the Laplacian pyramid is the same layer of the Gaussian pyramid minus the up-sampling of the previous layer of the Gaussian pyramid. The specific process is shown in Figure 1. Its expression is:

L_{i} = G_{i} - [U P (G_{i + 1}) \otimes g_{n * n}],

(16)

where

L_{i}

is the image of the i-th layer of the Laplacian pyramid,

G_{i}

is the image of the i-th layer of the Gaussian pyramid, and

U P ()

represents the up-sampling process. The specific process is shown in the figure.

2.3.3. Spectral Graph Fusion

After the MFCC image and the CWT image are processed by the Gaussian-Laplacian pyramid, three-level images of the Laplacian pyramid are obtained. Based on these six images, they are fused at different levels of the pyramid. The fusion adopts the gradient amplitude calculation method and uses two convolution kernels, horizontal convolution kernel

G_{x}

and vertical convolution kernel

G_{y}

, through the Sobel operator. The expression is as follows:

G_{x} = (\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}),

(17)

G_{y} = (\begin{matrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ - 1 & - 2 & - 1 \end{matrix}),

(18)

Perform convolution calculation on the image of a certain layer in the Laplace pyramid and calculate the gradient amplitude, that is:

I_{x} = Im a g e \otimes G_{x},

(19)

I_{y} = Im a g e \otimes G_{y},

(20)

G = \sqrt{{I_{x}}^{2} + {I_{y}}^{2}},

(21)

where

I_{x}

represents the horizontal gradient of the image,

I_{y}

represents the vertical gradient of the image, and

G

represents the gradient amplitude.

Since there is a square root function in the calculation of the gradient amplitude using Formula (22), the amount of calculation will increase. Therefore, Formula (22) is simplified to:

G = |I_{x}| + |I_{y}| .

(22)

After applying the Sobel operator to the two images, they may be reduced in size. However, using the Sobel operator in OpenCV automatically fills in the edges, resulting in a gradient magnitude matrix of the same size as the two images. The value at each position in the matrix represents the edge strength or feature saliency of the corresponding source image at that position and scale. This creates a fusion mask, whose formula is:

M a s k = \{\begin{cases} 1, k_{M} \geq k_{C} \\ 0, k_{M} < k_{C} \end{cases},

(23)

where

M a s k

represents the fusion mask matrix,

k_{M}

is the value at a certain position in the MFCC gradient magnitude matrix, and

k_{C}

is the value at the same position as

k_{M}

in the CWT gradient magnitude matrix. When

k_{M}

is greater than

k_{C}

at a certain position, it means that the features in the MFCC image at that position are more prominent than those in the CWT image, and vice versa.

During the fusion process, the fusion strategy must first be determined based on the current pyramid level, which in turn determines the value of the mask. At the bottom of the pyramid, the image contains rich, fine-grained information. In this case, the fusion strategy tends to select areas with significant gradients and stronger details, aiming to maximize the clarity of high-frequency features such as texture and edges. At higher levels of the pyramid, the image exhibits coarse macroscopic structures, and the strategy adjusts to select areas with weaker gradients and smoother textures. Because high-level features reflect the overall contours of the image, this paper aims to derive the primary structure of the fusion result from the MFCC spectrum. Although some high-frequency components of the CWT spectrum may still have a high response at higher levels, this response may be due to irrelevant noise or overly trivial texture. By inverting the high-level mask, this interference can be effectively suppressed, ensuring that the smooth structure in the MFCC is highlighted and preserved. The formula for the inverted mask is:

M a s k = 1 - M a s k,

(24)

2.3.4. Reconstruction

The Laplace pyramid transform images of the MFCC image and the CWT image are fused layer by layer through gradient fusion to obtain the fused images

f u s i o n_{0}

,

f u s i o n_{1}

and

f u s i o n_{2}

. Reconstruction is performed based on this. The expression of image reconstruction is:

R_{i} = \{\begin{cases} f u s i o n_{i} + U P (R_{i - 1}), i \geq 1 \\ f u s i o n_{i}, i = 0 \end{cases} .

(25)

where A is the reconstructed image. Ultimately, only the bottom-level reconstructed image is retained and used as the fusion result. The fusion process and the fusion result are shown in Figure 3 and Figure 4.

Since

f u s i o n_{0}

,

f u s i o n_{1}

, and

f u s i o n_{2}

in the Laplacian pyramid layer are all grayscale images with a size of [1, 224, 224], in order to meet the input size of the model, it is necessary to perform color reconstruction on the single-channel grayscale image. The formula is as follows:

N orm = \frac{G}{255},

(26)

Im g = c m a p (N o r m) .

(27)

where

G

is the grayscale image and

cmap ()

is the color map of each pixel.

3. Model

The Swin Transformer, a significant evolution of ViT, effectively addresses these limitations by introducing a hierarchical structure and a shifting window mechanism. The Swin Transformer achieves stronger modeling capabilities and higher computational efficiency across multiple visual tasks. Its core innovation lies in restricting self-attention computation to non-overlapping local windows, significantly reducing computational complexity. Furthermore, it enables cross-window connections through window shifting, achieving global modeling capabilities while maintaining linear computational complexity. The limitations of the standard Swin Transformer model in fault diagnosis are addressed by the introduction of a dedicated saliency branch. The original model typically assumes that all patches are equally important when processing images, which, to a certain extent, ignores the key features most relevant to the fault. To address this, we added a saliency module that simulates the attention mechanism of human vision, guiding the model to focus on salient areas in the image, thereby improving the ability to identify fault features. The improved model architecture is shown in Figure 5.

3.1. Significance Boosting Module

The basic version of the Swin Transformer consists of only the main branch. Its core function is to divide the input image into non-overlapping patches and implement a linear embedding mapping from pixel space to a high-dimensional feature space. This process is efficiently accomplished using a single convolution operation with 3 input channels, 96 output channels, a kernel size of Patch Size × Patch Size, and a stride of Patch Size. Therefore, a single convolution can simultaneously perform image segmentation and feature projection, directly mapping the input into a patch embedding representation of shape [Batch, 96, 56, 56]. Whether using CWT, MFCC, or a fusion of both, redundant features in the voiceprint signal are simultaneously extracted. For the transient characteristics of relay operating sounds, these redundant features are even more complex, making the single main branch insufficient for fault classification. Therefore, this paper introduces a saliency branch specifically to enhance the Swin Transformer’s ability to distinguish relay operating sounds.

The saliency module designed in this paper is a lightweight attention-guided network whose core function is to generate weight maps that indicate important regions in the input image. To address the limitations of the standard Swin Transformer model in fault diagnosis, a dedicated saliency module is introduced.

This module first uses 3 × 3 convolutional kernels with a stride of 1 and padding of 1 for feature extraction, expanding the 3 input channels to 16, as shown in the following formula:

I_{p a d d e d} = p a d (I, (1, 1, 1, 1)),

(28)

F^{(1)} = W^{(1)} * I_{p a d d e d} + b^{(1)},

(29)

where

I_{p a d d e d}

represents the filled input,

I

represents the model input,

F^{(1)}

represents the output of the first convolutional layer,

W^{(1)}

represents the weights of the 3 × 3 convolutional kernel with 16 output channels of 3,

b^{(1)}

represents the bias of the first convolutional layer, and

*

represents the convolution calculation.

Subsequently, a nonlinear transformation is introduced using the ReLU activation function to enhance the model’s expressive power. The formula is as follows:

F_{r e l u}^{(1)} = m a x (0, F^{(1)}),

(30)

Next, a 3 × 3 convolution kernel is used to compress the number of feature channels to 1, generating an initial saliency response map. Each value on this single-channel feature map represents a score of the “saliency” of the corresponding pixel location in the original image, and its formula is as follows:

S_{b e f o r e_s i g m o i d} = W^{(2)} * F_{r e l u}^{(1)} + b^{(2)},

(31)

where

W^{(2)}

represents the weight of a 3 × 3 convolution kernel with 16 input channels and 1 output channel, and

b^{(2)}

represents the bias of the second convolution process.

Finally, the response values are normalized to the [0, 1] interval using the Sigmoid activation function, forming the final significance weight map, as shown in the formula:

S = σ (S_{b e f o r e_s i g m o i d}) = \frac{1}{1 + e x p (- S_{b e f o r e_s i g m o i d})},

(32)

After weight fusion of the weighted image with the original input, the contribution of important regions is highlighted while irrelevant background information is suppressed. Finally, the weighted image is projected onto a 96-dimensional feature space through a 4 × 4 convolutional kernel with a stride of 4, generating a 56 × 56 multi-channel feature representation, as shown in the following formula:

F_{m a i n} = P r o j_{m a i n} (I),

(33)

I_{w e i g h t e d} = I ⊙ S,

(34)

F_{s a l i e n c y} = P r o j_{s a l i e n c y} (I_{w e i g h t e d}),

(35)

F_{f u s e d} = F_{m a i n} + F_{s a l i e n c y},

(36)

where

P r o j_{m a i n}

represents the convolution projection of the main branch,

P r o j_{s a l i e n c y}

represents the convolution projection of the salient branch, and

⊙

represents the pointwise multiplication of the two matrices based on the broadcast mechanism.

This design enables the saliency branch to autonomously learn the attention distribution of key regions in the Mel spectrogram, guiding the subsequent feature extraction process to focus on more information-rich frequency bands and time domain segments, thereby improving the feature representation quality of the overall architecture.

3.2. Swin-Transformer Block

The Swin Transformer block is the core component of the Swin Transformer architecture. By introducing SW-MSA, it achieves higher computational efficiency and stronger hierarchical modeling capabilities based on the standard Transformer block. Each Swin Transformer block consists of W-MSA, SW-MSA, and MLP, supplemented by LayerNorm and residual connections. The key lies in the alternating use of regular window partitioning and shifted window partitioning. The former restricts self-attention calculations to non-overlapping local windows, significantly reducing computational complexity; the latter achieves cross-window information interaction by regularly offsetting windows by half the window size, achieving global modeling capabilities while maintaining linear computational complexity. The specific process of the Swin Transformer block is shown in the Figure 6. The calculation process is as follows:

{\hat{O}}_{i} = W - M S A (L a y e r N o r m (O_{i - 1})) + O_{i - 1},

(37)

O_{i} = M L P (L a y e r N o r m ({\hat{O}}_{i})) + {\hat{O}}_{i},

(38)

{\hat{O}}_{i + 1} = S W - M S A (L a y e r N o r m (O_{i})) + O_{i},

(39)

O_{i + 1} = M L P (L a y e r N o r m ({\hat{O}}_{i + 1})) + {\hat{O}}_{i + 1} .

(40)

where

{\hat{O}}_{i}

and

O_{i}

represent the output results of the (S)W-MSA module and MLP module of the i-th Swin-Transformer block, respectively.

During the Swin model’s attention calculation process, each Swin-Transformer block performs self-attention within a local window. In the Tiny version of the model used in this article, the window size is set to 7 × 7, meaning each window contains 49 patches. The feature tensor shape of the input image after linear projection is [Batch, 96, 56, 56], corresponding to each feature map being divided into 8 × 8 = 64 non-overlapping windows in the spatial dimension. Attention calculation is performed independently within each such window. The window-based self-attention mechanism converts the input tensor into query, key, and value matrices through three independent linear layers. The specific mathematical formula for this calculation process is:

Q = W^{Q} \times P atch_W i n d o w,

(41)

K = W^{K} \times P atch_W i n d o w,

(42)

V = W^{V} \times P atch_W i n d o w,

(43)

Among them,

Q, K, V

represent the matrices of Query, Key and Value in a certain window respectively.

W^{Q}

,

W^{K}

and

W^{V}

represent the learnable linear projection matrices of Query, Key and Value respectively, and

W^{Q}, W^{K}, W^{V} \in R^{49 \times 49}

,

P atch_W i n d o w \in R^{49 \times 96}

and

Q, K, V \in R^{49 \times 96}

. The core step of the self-attention mechanism is to calculate the attention score between Query and Key. The specific calculation formula is:

E = \frac{Q K^{T}}{\sqrt{d_{k}}},

(44)

In the formula,

E

represents the Attention score matrix of the Query for the Key. Softmax regression is performed on the Attention score matrix to obtain the Attention Weights matrix. Finally, the Attention Weights matrix is multiplied by the Value to obtain the final output matrix, that is:

w e i g h t = s o f t \max (E),

(45)

O_{W} = w e i g h t \times V,

(46)

where

w e i g h t

represents the Attention Weights matrix, and

O_{W}

represents the output matrix of self-attention calculation within the window.

To establish cross-window connections and achieve global modeling capabilities, the Swin model performs a window shift after completing the first-stage window self-attention calculation within each block. The shift distance is set to half the window size: for a 7 × 7 window, the shift amount is 7 ÷ 2 = 3. This shifts the window by 3 pixels both horizontally and vertically, creating a new set of windows. The second-stage attention calculation then continues within these newly formed windows. While this window shifting operation facilitates cross-window information exchange, it doubles the number of windows from the original 64, significantly increasing the computational burden. To mitigate the resulting increase in computational complexity, the model avoids the traditional padding approach to handle shifted non-7 × 7 windows. Instead, it introduces a masking mechanism. This mechanism accurately and efficiently completes the self-attention calculation within the shifted windows while maintaining the original computational efficiency. This method shifts and splices all non-7 × 7 windows to 7 × 7 windows. The detailed processing process for four windows is shown in Figure 7. The principle for 64 windows is the same. After the mobile splicing process, the calculation continues according to the window-based self-attention mechanism. However, after calculating the Attention score matrix, the Mask matrix is added to the Attention score matrix to delete the non-self-attention calculation part, and finally the Softmax regression is used to reduce the probability of the prediction result to 0. The specific calculation formula is:

E_{S W} = E + M a s k,

(47)

where

E_{S W}

represents the attention score matrix based on the moving window,

E

represents the attention score matrix calculated for the moving window data using the window self-attention mechanism, and

M a s k

is the mask matrix. There are four specific forms of the mask, as shown in Figure 8.

3.3. Classification Process

Since the Swin model doesn’t include a classification component, the model uses a combination of global average pooling and linear projection to achieve final classification. After processing by the I-Swin model, the input image is converted from a tensor of size [B, 3, 224, 224] to a feature tensor of size [B, 7, 7, 768]. The first step involves a global average pooling layer to aggregate the spatial dimensions, compressing the feature tensor of [B, 7, 7, 768] to [B, 1, 1, 768]. This is equivalent to fusing the 49-dimensional spatial features of each channel into a single global feature value. This tensor is then flattened into a vector of size [B, 768] using a Flatten layer. Finally, a linear classification layer maps this vector to the class space, outputting a predicted value of shape [B, num-classes], completing the classification process. The detailed process is as follows:

O_{I S T} = I - S w i n ([B, 3, 224, 224]) = [B, 768, 7, 7],

(48)

O_{A v g} = A d a p t i v e A v g P o o l 1 d (O_{I S T}) = [B, 768, 1],

(49)

O_{F l a} = F l a t t e n (O_{A v g}) = [B, 768],

(50)

O u t p u t = L i n e a r (O_{F l a}) = [B, n u m b e r_o f_c l a s s e s] .

(51)

where

O_{I S T}

represents the size of the feature vector after processing by the improved Swin-Transformer model;

O_{A v g}

represents the size of the feature vector after local average pooling;

O_{F l a}

represents the size of the model after flattening; and

O u t p u t

represents the model output size.

4. Experiment

The experimental data for this study was collected from a railway signal relay maintenance station. In actual operating environments, the probability of relay failure is low, so the data collected in this study focused on the most common relay types with the highest failure rate. Faulty relays were repeatedly operated within the maintenance station, and the sound signals during their faulty states were recorded, ultimately generating independent audio files for each failure. This study focused on critical relays in railway signal systems that are involved in switch control and have significant failure impacts. Three models, the JWXC-1700, JWJXC-H125/80, and JYJXC-160/260, were selected as research subjects. At the same time, according to the fault frequency of the relay, six fault types are selected, namely, JWXC-1700 model contact deformation, JWXC-1700 model resistance exceeding the standard, JWXC-1700 model coil short wire, JWXC-1700 model contact surface oxidation, JWJXC-H125/80 model contact jamming and JYJXC-160/260 model suction jamming.

4.1. Dataset

In order to allow the model to learn the fault pattern through voiceprint data, the number of samples for each fault type is about 1000. At the same time, the normal category includes the normal operating sounds of all three models, but the proportion of each type in the number of normal category samples is different. The specific number of data set samples is shown in Table 2:

The spectrum diagrams under MFCC, CWT and Fusion feature extraction methods are shown in Figure 9.

4.2. Comparative Experiment

To improve experimental efficiency, all training and inference processes were performed on the same deep learning server. The server configuration is as follows: the operating system is Ubuntu 22.04 (kernel version 5.15.0), equipped with a single NVIDIA L20 GPU (48 GB of video memory). The software environment is based on Python 3.9 and implemented with the PyTorch 2.5.0 deep learning framework. Based on the classification task in this paper, the Tiny model of the Swin model was selected as the basis, and the saliency module proposed in this paper was introduced on top of this base model. Typical parameters are as follows: the base feature dimension is 96, the number of Transformer blocks in the four stages is 2, 2, 6, and 2, respectively, and the number of heads of the multi-head self-attention mechanism in each stage is 3, 6, 12, and 24, respectively. The model uses a 4 × 4 initial image block partition and a 7 × 7 local attention window. The MLP expansion ratio is 4.0, and biases are enabled in the query, key, and value linear transformations.

To fully verify the effectiveness and superiority of the proposed method in the relay action sound fault classification task, we selected several of the most widely used models in the voiceprint fault diagnosis task and the latest models in the voiceprint classification field for comparative analysis, including ResNet-FD-CNN [4], GRU [22], MobileNet [23], ViT [24], and AST [25]. All comparative experiments were conducted on the voiceprint image dataset after processing with the Gaussian-Laplacian pyramid fusion rule. This dataset covers the most typical fault types, the most frequent fault modes, and voiceprint features under normal conditions in relay action sounds, thus enabling a comprehensive evaluation of the classification performance of different methods in complex fault scenarios. Through systematic comparative experiments, we verified the advantages of the proposed method in terms of classification accuracy and robustness.

In the comparison methods, ResNet-FD-CNN, MobileNet, ViT, AST, and the proposed I-Swin method all use images fused with three-channel RGB as input. Considering the special characteristics of the GRU model when processing image data, we keep its input as a grayscale fused image without color reconstruction, and the model structure is set to 4 hidden layers, each containing 64 hidden units. For the ViT model, due to the large number of parameters required for de novo training, and the limited experimental data in this paper, direct training results are not good. Therefore, we adopt a transfer learning strategy, loading ViT with weights pre-trained on the ImageNet-1K dataset and fine-tuning it with a learning rate of 1e−5. The remaining hyperparameters are kept consistent with the proposed I-Swin method to ensure the fairness of the comparison. The test results of the trained weights on the test set are shown in Table 3.

Table 3 shows the test set results of the five methods after training with fused images. Experimental results show that the I-Swin, Res-FD-CNN, MobileNet, AST, and ViT models using the color reconstruction dataset all perform well in the relay action sound fault classification task, while the GRU model using fused grayscale images performs poorly. Specifically, GRU is extremely sensitive to irrelevant features in the image, and grayscale fused images lack an effective feature suppression mechanism, making it difficult for the model to capture key discriminative information in the classification task. Although the other models besides GRU perform well overall, considering the strict requirements for preventing false alarms in relay fault repair, the fault detection accuracy needs to reach above 95%. While Res-FD-CNN and MobileNet have some classification ability, they fail to meet this key indicator. In contrast, the proposed I-Swin method and ViT and AST, while maintaining high overall performance, have significantly better accuracy than the comparison models, better meeting the requirements for preventing false alarms in practical engineering.

The table details the parameter size comparison of the five models. It can be seen that GRU has the fewest parameters, while ViT and AST have a large number of parameters. In contrast, the proposed I-Swin model has less than half the number of parameters as ViT and AST, but its classification accuracy on the test set is significantly higher than GRU and only slightly different from ViT and AST. Specifically, the I-Swin model is based on the Tiny version of the Swin model, with a basic parameter size of approximately 28 M. After introducing the saliency module, the parameter size increases to 37.4 M, while the model’s classification performance is further improved. The key to the I-Swin model’s ability to maintain high classification accuracy while significantly reducing the number of parameters lies in its window-moving attention mechanism. In the ViT and AST models, each patch needs to perform global attention interactions with all other image patches. For example, each 14 × 14 patch needs to perform cross-attention calculations with all other 16 × 16 patches, leading to a sharp increase in computational complexity. The I-Swin model, by dividing the feature map into multiple local windows, restricts attention calculations to within each window, significantly reducing the computational burden. Meanwhile, the patch size within the window was further reduced from 14 × 14 to 4 × 4, and the window movement strategy was used to maintain the model’s global modeling capability by enabling information interaction between different local regions. It is worth noting that the reduced number of parameters allows I-Swin to achieve sufficient training on the dataset presented in this paper.

Under the same hardware and parameter configuration, a time efficiency experiment was conducted on the I-Swin model, and the results are shown in Table 4. Both the I-Swin and Swin models significantly reduced FLOPs, single-round training time, and inference time compared to ViT and AST. The results demonstrate that the I-Swin model possesses excellent time efficiency in the voiceprint fault detection task and can effectively meet the actual requirements of real-time performance and computational resources for railway signal relay fault detection.

To evaluate the model’s stability, we trained the model independently ten times under fixed training parameters, hardware environment, and dataset conditions, and recorded its performance metrics. The stability experiment results are shown in Table 5. The standard deviations of the accuracy, recall, and F1 score obtained from the ten training runs were all between 0.1% and 0.2%, indicating that the model has strong training stability and reproducibility.

In summary, in the relay fault voiceprint classification task, the method proposed in this paper not only demonstrates excellent classification performance, but also achieves a better balance between computational efficiency and performance due to its significantly reduced number of parameters.

Figure 10 and Figure 11 show the loss and accuracy curves of the I-Swin model on the Fusion image training and validation sets, respectively. The results demonstrate that the proposed method achieves rapid convergence on both the training and validation sets, with satisfactory accuracy.

4.3. Ablation Experiments

To systematically validate the effectiveness of the proposed I-Swin model structure and MFCC-CWT fusion rule, we conducted detailed ablation experiments. These experiments focused on two core design concepts: first, evaluating the contribution of the saliency module to model performance by comparing the performance of I-Swin with the original Swin model Tiny version on the same dataset; and second, verifying the advantages of the proposed fusion rule by comparing the model’s classification performance under three different data inputs: MFCC, CWT, and MFCC-CWT fusion. All comparative experiments were conducted with a unified training setup and corresponding datasets to ensure comparability of results and the reliability of conclusions.

Table 6 shows that when the Swin model is used, regardless of the dataset used, whether MFCC, CWT, or Fusion, classification performance is good, but none of them meet the requirement for false alarm prevention. This phenomenon is primarily due to the fact that certain relay voiceprints exhibit extremely similar characteristics, such as the voiceprints of JWXC-1700 contacts with surface oxidation and those of normal operation. Surface oxidation forms an oxide layer, resulting in relatively small changes in the voiceprint, making classification errors prone to errors in such categories. The voiceprint feature extraction and fusion processes of MFCC, CWT, and Fusion not only extract features that have a positive impact on classification, but also features that have no or negative impact. The Swin-Transformer model, without a saliency module, treats all of these features as key diagnostic features during training, ultimately resulting in poorer classification performance compared to the I-Swin model.

Table 6 also shows that the classification effect of the I-Swin model on the CWT dataset alone is lower than that of the Swin model. Experimental analysis shows that this is closely related to the CWT feature extraction image. Compared to the MFCC features that are concentrated in a small area, the CWT features appear as continuous, clustered energy clusters. In this case, a saliency module designed to find local prominent areas conflicts with the characteristics of the data. When a large area in the atlas contains discriminative information, using the module to focus on a “smaller” salient area will actually destroy the original integrity and continuity of the CWT features, filter out useful contextual information, and lead to performance degradation.

Ultimately, a comprehensive comparison shows that the I-Swin model and the Fusion dataset solution achieve stronger classification performance, meeting the requirements for preventing false positives. Furthermore, as shown in Section 3.2, it also uses fewer parameters, achieving a better balance between computational efficiency and performance.

4.4. Confusion Matrix

To more intuitively demonstrate the effectiveness of the proposed method, the test set results of I-Swin weights trained on the Fusion dataset are visualized using a confusion matrix. Figure 12 shows the confusion matrix.

The vertical axis of the confusion matrix represents the true labels of the samples, and the horizontal axis represents the labels predicted by the model. The confusion matrix shows that when the model classifies samples, the main classification error occurs when it misclassifies samples with the true label “JWJXC-1700 model contact oxidation” as “JWJXC-1700 model deformation”. A total of 23 samples were misclassified, and another two were misclassified as “normal”. Otherwise, the model correctly classified all other samples, demonstrating strong classification capabilities.

5. Conclusions

To address the challenge of railway signal relay fault diagnosis, where the transient nature of operating sounds is similar to features from different categories, this paper proposes a novel method that fuses MFCC and CWT features with an improved Swin-Transformer. This method uses MFCC features as the primary framework for the fused image and introduces CWT features as an effective supplement, jointly constructing a “fusion image” that fuses multi-source information. Furthermore, this paper designs a saliency enhancement module that significantly improves the discriminability of fault features in the fusion image by strengthening key features and suppressing redundant ones, laying an important foundation for subsequent improvements in classification accuracy.

Experimental results demonstrate that the proposed method achieves an optimal balance between performance and computational complexity in fault classification tasks. Compared to pre-trained and fine-tuned ViT and AST models, ViT and AST slightly outperform the proposed I-Swin method on the test set. However, ViT and AST have more model parameters, higher computational complexity, and significantly slower inference and training speeds per epoch compared to the I-Swin method. In terms of classification accuracy, the I-Swin method is only about 1% lower than ViT and AST, exhibiting strong competitiveness. Regarding feature extraction capabilities, ablation experiments validate the effectiveness of the proposed fusion rule based on MFCC and CWT in relay voiceprint fault diagnosis tasks. This fusion rule employs a Gaussian-Laplace pyramid. This fusion process provides the model with information-rich input samples, fully encompassing the discriminative patterns in the voiceprint signal, laying an important foundation for subsequent accurate classification.

However, this study still has certain limitations. First, because the faults collected in this study were from a batch of relays currently undergoing repair at a railway signal relay maintenance station, the fault types do not cover all relay models. Second, to achieve the ultimate goal of “fault repair,” relying solely on classification accuracy is insufficient; the model’s inference efficiency and real-time performance are also crucial in practical deployments. Finally, actual relay workplaces are subject to some noise, such as human activity and ventilation fan noise. Considering only the noise-free case does not guarantee the generalizability of the method across diverse scenarios. Therefore, future research requires collecting more diverse faults from different relay types, including at least the most frequent faults across all critical relays (such as turnout relays and signal light relays). Furthermore, in subsequent studies, noise should be introduced to measure classification performance under varying signal-to-noise ratios. While maintaining the current classification accuracy, further efforts should be made to reduce the model’s structure and optimize its inference speed to better meet the application requirements of real-time railway field diagnosis.

Author Contributions

Conceptualization, Y.L., L.C. and S.Z.; software, Y.L.; investigation, Y.L., L.C., Z.W., S.Z. and B.Z.; resources, Z.W., L.C. and B.Z.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, Z.W., L.C., S.Z. and B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by National Engineering Research Center for High-Speed Railway and Urban Rail Transit Systems Technology of China Academy of Railway Sciences, grant name “Research on Fault Identification through Acoustic Feature Extraction and Multi-Feature Fusion in Railway Signal Relays” project (2025YJ238), by Zhongxing Wang (Dalian) Technology Co., Ltd. research project (AI24L00070) and by Major Innovation Projects of Tianjin’s Leading Science and Technology Enterprises (24YDLQYS00110).

Data Availability Statement

The datasets presented in this article are not readily available because [the data are part of an ongoing study]. Requests to access the datasets should be directed to [24120224@bjtu.edu.cn].

Conflicts of Interest

Author Liang Chen was employed by the company Railway Science & Technology Research & Development Center China Academy of Railway Sciences Corporation Limited. Author Zhen Wang was employed by the company Hefei Signaling and Telecommunications Section of China Railway Shanghai Group Co., Ltd. The authors declare that this study received funding from Zhongxing Wang (Dalian) Technology Co., Ltd. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GAN	Generative Adversarial Network
VMD	Variational Mode Decomposition
IMF	Intrinsic Mode Function
PCA	Principal Component Analysis
LSTM	Long Short-Term Memory
MFCC	Mel Frequency Cepstral Coefficient
CWT	Continuous Wavelet Transform
Swin	Hierarchical Vision Transformer using Shifted Windows
I-Swin	Improved-Hierarchical Vision Transformer using Shifted Windows
GRU	Gate Recurrent Unit
ViT	Vision Transformer
ResNet	Residual Networks
W-MSA	Window-Multi-head Self Attention
SW-MSA	Shifted Window-Multi-head Self Attention
DCT	Discrete Cosine Transform
FFT	Fast Fourier Transform
MLP	Multilayer Perceptron
AST	Audio Spectrogram Transformer

References

Song, H.; Li, L.; Li, Y.; Tan, L.; Dong, H. Functional Safety and Performance Analysis of Autonomous Route Management for Autonomous Train Control System. IEEE Trans. Intell. Transp. Syst. 2024, 25, 13291–13304. [Google Scholar] [CrossRef]
Song, H.; Gao, S.; Li, Y.; Liu, L.; Dong, H. Train-Centric Communication Based Autonomous Train Control System. IEEE Trans. Intell. Veh. 2023, 8, 721–731. [Google Scholar] [CrossRef]
Liu, Z.; Chen, Y.; Zhang, D.; Guo, F. A Dynamic Voiceprint Fusion Mechanism with Multispectrum for Noncontact Bearing Fault Diagnosis. IEEE Sens. J. 2025, 25, 8710–8720. [Google Scholar] [CrossRef]
Gunson, N.; Marshall, D.; McInnes, F.; Jack, M. Usability evaluation of voiceprint authentication in automated telephone banking: Sentences versus digits. Interact. Comput. 2011, 23, 57–69. [Google Scholar] [CrossRef]
Huang, Z.; Zhang, X.; Wang, L.; Li, Z. Study and implementation of voiceprint identity authentication for Android mobile terminal. In Proceedings of the 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Shanghai, China, 14–16 October 2017; pp. 1–5. [Google Scholar]
Zhao, H.; Yue, L.; Wang, W.; Xiangyan, Z. Research on End-to-end Voiceprint Recognition Model Based on Convolutional Neural Network. J. Web Eng. 2021, 20, 1573–1586. [Google Scholar] [CrossRef]
Zhang, K.; Lu, H.; Han, S.; Zhao, X. A Novel Causal Federated Transfer Learning Method for Power Transformer Fault Diagnosis Based on Voiceprint Recognition. IEEE Sens. J. 2025, 25, 35573–35584. [Google Scholar] [CrossRef]
Liu, M.; Li, Z.; Sheng, G.; Jiang, X. A Deep Learning Algorithm for Power Transformer Voiceprint Recognition in Strong-Noise and Small-Sample Scenarios. IEEE Trans. Instrum. Meas. 2025, 74, 1–11. [Google Scholar] [CrossRef]
Yu, Z.; Wei, Y.; Niu, B.; Zhang, X. Automatic Condition Monitoring and Fault Diagnosis System for Power Transformers Based on Voiceprint Recognition. IEEE Trans. Instrum. Meas. 2024, 73, 1–11. [Google Scholar] [CrossRef]
Zheng, W.; Zhang, F.; Sun, X.; Tan, H.; Li, C. Recognition of short-circuit discharge sounds in transmission lines based on hybrid voiceprint features. In Proceedings of the 2024 6th International Conference on Internet of Things, Automation and Artificial Intelligence (IoTAAI), Guangzhou, China, 26–28 July 2024; pp. 200–204. [Google Scholar]
Mao, P.; Zhang, Y.; Li, W.; Shi, Z. Research on Voiceprint Recognition Method of Hydropower Unit Online-State Based on Mel-Spectrum and CNN. In Proceedings of the 2024 IEEE 8th Conference on Energy Internet and Energy System Integration (EI2), Shenyang, China, 29 November–2 December 2024; pp. 7–11. [Google Scholar]
Mushi, R.; Huang, Y.-P. Assessment of Mel-Filter Bank Features on Sound Classifications Using Deep Convolutional Neural Network. In Proceedings of the 2021 International Conference on System Science and Engineering (ICSSE), Ho Chi Minh City, Vietnam, 26–28 August 2021; pp. 334–339. [Google Scholar]
Shia, S.E.; Jayasree, T. Detection of pathological voices using discrete wavelet transform and artificial neural networks. In Proceedings of the 2017 IEEE International Conference on Intelligent Techniques in Control, Optimization and Signal Processing (INCOS), Srivilliputtur, India, 23–25 March 2017; pp. 1–6. [Google Scholar]
Xiong, L.; Liao, Z.; Liang, Y.; Gong, X. Research on Voiceprint Recognition Based on MFCC-PCA-LSTM. In Proceedings of the 2023 2nd International Conference on Big Data, Information and Computer Network (BDICN), Xishuangbanna, China, 6–8 January 2023; pp. 121–125. [Google Scholar]
Parisotto, S.; Calatroni, L.; Bugeau, A.; Papadakis, N.; Schönlieb, C.-B. Variational Osmosis for Non-Linear Image Fusion. IEEE Trans. Image Process 2020, 29, 5507–5516. [Google Scholar] [CrossRef] [PubMed]
Prakash, O.; Srivastava, R.; Khare, A. Biorthogonal wavelet transform based image fusion using absolute maximum fusion rule. In Proceedings of the 2013 IEEE Conference on Information & Communication Technologies, Thuckalay, India, 11–12 April 2013; pp. 577–582. [Google Scholar]
Li, M.; Jian, Z.; Jie, L. Research on Image Fusion Rules Based on Contrast Pyramid. In Proceedings of the 2018 Chinese Automation Congress (CAC), Shengyang, China, 20–22 December 2018; pp. 1361–1364. [Google Scholar]
Gulsoy, E.K.; Gulsoy, T.; Ayas, S.; Kablan, E.B. Remote Sensing Scene Image Classification with Swin Transformer-Based Transfer Learning. In Proceedings of the 2025 9th International Symposium on Innovative Approaches in Smart Technologies (ISAS), Gaziantep, Turkiye, 27–28 June 2025; pp. 1–6. [Google Scholar]
Zhang, X.; Wen, C. Fault diagnosis of rolling bearings based on dual-channel Transformer and Swin Transformer V2. In Proceedings of the 2024 43rd Chinese Control Conference (CCC), Kunming, China, 28–31 July 2024; pp. 4828–4834. [Google Scholar]
Dou, H.; Ma, L.; Gao, C.; Zhang, Y. A Dual-Channel Decision Fusion Framework Integrating Swin Transformer and ResNet for Multi-Speed Gearbox Fault Diagnosis. In Proceedings of the 2025 4th Conference on Fully Actuated System Theory and Applications (FASTA), Nanjing, China, 4–6 July 2025; pp. 1469–1474. [Google Scholar]
Fan, F.; Wang, B.; Zhu, G.; Wu, J. Efficient Faster R-CNN: Used in PCB Solder Joint Defects and Components Detection. In Proceedings of the 2021 IEEE 4th International Conference on Computer and Communication Engineering Technology (CCET), Beijing, China, 13–15 August 2021; pp. 1–5. [Google Scholar]
Abulizi, J.; Chen, Z.; Liu, P.; Sun, H.; Ji, C.; Li, Z. Research on Voiceprint Recognition of Power Transformer Anomalies Using Gated Recurrent Unit. In Proceedings of the 2021 Power System and Green Energy Conference (PSGEC), Shanghai, China, 20–22 August 2021; pp. 743–747. [Google Scholar]
Hongming, L.; Zhang, K.; Han, S. A Comparison of CNN-Based Transformer Fault Diagnosis Methods Based on Voiceprint Signal. In Proceedings of the 2024 IEEE 13th Data Driven Control and Learning Systems Conference (DDCLS), Kaifeng, China, 17–19 May 2024; pp. 819–824. [Google Scholar]
Zhou, S.; Chen, L.; Feng, B.; Xiao, J.; Zheng, W. Voiceprint Diagnosis Method for Urban Rail Power Transformers Based on Mel Spectrogram and Improved Vision Transformer. In Proceedings of the 2025 7th International Conference on Software Engineering and Computer Science (CSECS), Taicang, China, 21–23 March 2025; pp. 1–8. [Google Scholar]
Cappellazzo, U.; Falavigna, D.; Brutti, A.; Ravanelli, M. Parameter-Efficient Transfer Learning of Audio Spectrogram Transformers. In Proceedings of the 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP), London, UK, 22–25 September 2024; pp. 1–6. [Google Scholar]

Figure 1. Four main faults of railway signal relays.

Figure 2. Three-layer Gaussian-Laplacian pyramid diagram for processing the MFCC.

Figure 3. Fusion₂ is up-sampled and added to Fusion₁ to obtain the result, and the result is up-sampled and added to Fusion₀ to obtain the fused grayscale image, which is then color mapped to obtain the fused RGB image.

Figure 4. From (left) to (right): CWT image, MFCC image, and fused image.

Figure 5. Improved Swin Transformer model.

Figure 6. Swin Transformer block specific architecture.

Figure 7. The moving stitching process of 4 windows.

Figure 8. The Mask matrix of each window after moving in Figure 7, where purple indicates a value of −100 and yellow indicates a value of 0.

Figure 9. Specific data of the three types of data sets: MFCC, CWT, and Fusion.

Figure 10. Line chart of training loss and validation loss.

Figure 11. Line chart of training accuracy and validation accuracy.

Figure 12. Confusion matrix of the I-Swin model classification on the Fusion test set.

Table 1. Feature Extraction Literature Analysis.

Author	Feature Extraction Methods	Advantages	Limitations
P. Mao et al.	Mel spectrum [9].	Mel spectrometers offer higher resolution for low frequencies and lower resolution for high frequencies.	Phase information was completely discarded.
R. Mushi et al.	Features of the Mel filter [10].	Effective dimensionality reduction was achieved, reducing the computational load.	Time resolution and frequency resolution cannot be achieved simultaneously.
S. E. Shia et al.	Based on wavelet transform [11].	Capable of providing variable time-frequency resolution.	The choice of wavelet basis depends on prior knowledge and experience.
L. Xiong et al.	MFCC and ΔMFCC [12].	Features are highly compact and solution-related.	Sensitive to noise.

Table 2. Relay Voiceprint Data Training Set, Validation Set, and Test Set Configuration.

	Train		Val	Test
JWXC-1700 Contact deformation	849		105	105
JWXC-1700 Resistance exceeds standard	870		108	108
JWXC-1700 Coil break	871		108	108
JWXC-1700 Contact surface oxidation	816		102	102
JWJXC-H125/80 Foreign matter on contact	633		79	79
JYJXC-160/260 suction stuck	728		90	90
Normal (All relays)	JWXC-1700:	1523	152	152
	JWJXC-H125/80:	283	28	28
	JYJXC-160/260:	326	33	33
Normal class sum		2132	213	213

Table 3. Comparison of test results of different methods after training on the Fusion dataset.

Method	Accuracy	Recall	F1	Number of Parameter
I-Swin	0.963	0.954	0.957	37.4 M
Res-FD-CNN	0.923	0.923	0.901	21.8 M
GRU	0.827	0.827	0.825	1.2 M
MobileNet	0.921	0.921	0.918	3.5 M
ViT	0.972	0.972	0.969	86 M
AST	0.969	0.969	0.969	86 M

Table 4. Comparison of Timeliness.

Model	FLOPs	Time Required to Train One Epoch	Inference Time
I-Swin	~4.9 G	14 s	1.25 ms
Swin	~4.7 G	10 s	1.02 ms
ViT	~17.5 G	34 s	2.88 ms
AST	~17.5 G	37 s	2.94 ms

Table 5. Statistical analysis of I-Swin.

Time	Accuracy	Recall	F1
1	0.963	0.954	0.957
2	0.963	0.954	0.957
3	0.963	0.955	0.955
4	0.962	0.954	0.957
5	0.962	0.954	0.955
6	0.963	0.954	0.957
7	0.962	0.954	0.956
8	0.959	0.951	0.951
9	0.961	0.955	0.957
10	0.960	0.953	0.953
Mean	0.9618	0.9538	0.9555
Std	0.14%	0.11%	0.19%

Table 6. Comparison of ablation experiment test results.

Model	Method	Accuracy	Recall	F1
Swin	MFCC	0.915	0.911	0.908
Swin	CWT	0.914	0.905	0.907
Swin	Fusion	0.939	0.937	0.935
I-Swin	MFCC	0.949	0.949	0.945
I-Swin	CWT	0.875	0.866	0.868
I-Swin	Fusion	0.963	0.954	0.957

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Chen, L.; Wang, Z.; Zhou, S.; Zhao, B. Railway Signal Relay Voiceprint Fault Diagnosis Method Based on Swin-Transformer and Fusion of Gaussian-Laplacian Pyramid. Mathematics 2025, 13, 3846. https://doi.org/10.3390/math13233846

AMA Style

Liu Y, Chen L, Wang Z, Zhou S, Zhao B. Railway Signal Relay Voiceprint Fault Diagnosis Method Based on Swin-Transformer and Fusion of Gaussian-Laplacian Pyramid. Mathematics. 2025; 13(23):3846. https://doi.org/10.3390/math13233846

Chicago/Turabian Style

Liu, Yi, Liang Chen, Zhen Wang, Shangmin Zhou, and Bobo Zhao. 2025. "Railway Signal Relay Voiceprint Fault Diagnosis Method Based on Swin-Transformer and Fusion of Gaussian-Laplacian Pyramid" Mathematics 13, no. 23: 3846. https://doi.org/10.3390/math13233846

APA Style

Liu, Y., Chen, L., Wang, Z., Zhou, S., & Zhao, B. (2025). Railway Signal Relay Voiceprint Fault Diagnosis Method Based on Swin-Transformer and Fusion of Gaussian-Laplacian Pyramid. Mathematics, 13(23), 3846. https://doi.org/10.3390/math13233846

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Railway Signal Relay Voiceprint Fault Diagnosis Method Based on Swin-Transformer and Fusion of Gaussian-Laplacian Pyramid

Abstract

1. Introduction

1.1. Related Work

1.2. Research Results

1.3. Paper Structure

2. Feature Extraction and Fusion

2.1. MFCC

2.1.1. Framing

2.1.2. Windowing

2.1.3. FFT

2.1.4. Mel Filter

2.1.5. DCT

2.2. CWT

2.2.1. Wavelet Basis Function Selection

2.2.2. Wavelet Coefficient Calculation

2.3. Gaussian-Laplacian Pyramid

2.3.1. Gaussian Pyramid

2.3.2. Laplace Pyramid

2.3.3. Spectral Graph Fusion

2.3.4. Reconstruction

3. Model

3.1. Significance Boosting Module

3.2. Swin-Transformer Block

3.3. Classification Process

4. Experiment

4.1. Dataset

4.2. Comparative Experiment

4.3. Ablation Experiments

4.4. Confusion Matrix

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI