Failure Identification Method of Sound Signal of Belt Conveyor Rollers under Strong Noise Environment

: Accurately extracting faulty sound signals from belt conveyor rollers within the high-noise environment of coal mine operations presents a formidable challenge. To address this issue, this study introduces an innovative fault diagnosis method that merges the variational modal de-composition (VMD) model with the Swin Transformer deep learning network model. First, the study employed the adaptive VMD method to eliminate intense noise from the original signal of the rollers, while also assessing the reconstruction accuracy of the VMD signal across different modal components. Subsequently, we delved into the impact of the parameter structure of the Swin Transformer network model on the fault diagnosis accuracy. Finally, the accuracy of the method was validated using a sound test dataset from the rollers. The results indicated that optimizing the K-value of the VMD method effectively reduced the noise in the reconstructed signal, and the Swin Transformer excelled in extracting both local and global features. Specifically, on the conveyor roller sound dataset, it was shown that, after the VMD reconstruction of the signal so that the highest Pearson correlation coefficient corresponded to a modal component of 3 and adjusting the parameters of the Swin Transformer coding layer, the combination of the VMD+Swin-S model achieved an accuracy of 99.36%, while the VMD+Swin-T model achieved an accuracy of 98.6%. Meanwhile, the accuracy of the VMD+Swin-S model was higher than that of the VMD + CNN model combination, with 95.4% accuracy, and the VMD+ViT model, with 97.68% accuracy. In the example application experiments, compared with other models the VMD+Swin-S model achieved the highest accuracy rate at all three speeds, with 98.67%, 98.32%, and 97.65%, respectively. Overall, this approach demonstrated high accuracy and robustness, rendering it an optimal choice for diagnosing conveyor belt roller faults within environments characterized by strong noise.


Introduction
The safe operation of belt conveyors is essential to mining and heavy industries, where a significant source leading to failure comes from the belt's supporting rollers.Owing to their harsh working conditions, rollers are prone to various types of failures that directly impact the safety of coal transportation operations, the overall operational efficiency, and the lifespan of the belt conveyor.
Although applying audio signals in roller fault diagnosis is a promising direction [1], the substantial noise introduced by the conveyor belt's complex working environment might prevent interfering with the faults accurately.Thus, extracting the critical fault characteristics from such audio signals is of immense significance for ensuring belt conveyors' smooth and efficient operation.sound signal is often affected by high-intensity noise, and the direct use of the model to identify the signal interfered with by the noise will cause unsatisfactory fault identification.
In summary, an audio signal processing method is proposed in this study.The method utilizes VMD to reconstruct the input signal to reduce the interference of noise on the signal, and after the introduction of the Swin Transformer model, by adjusting the parameters of the coding layer, the Swin-S model adapted to the VMD is obtained to classify and identify the signal, which improves the accuracy of the diagnosis of the pulley faults of belt conveyor in the noisy environment, and provides a research idea for the identification of the pulley faults of a belt conveyor in a high-intensity noise environment.It provides a research idea for the identification of pulley faults in high-intensity noise environments.The remainder of this article is structured as follows.In Section 2, the whole fault diagnosis methodology process, the VMD algorithm, and the principle of the Swin Transformer network model are described.In Section 3, the effectiveness of the VMD+Swin-S model in belt conveyor pulley fault diagnosis is verified by comparing the recognition effects of multiple models.The effectiveness of the VMD+Swin-S model is verified through experiments on industrial field examples in Section 4. The conclusion of this article is shown in Section 5.

Fault Diagnosis Methodology Flow of VMD Combined with Swin Transformer
In this study, a combination of VMD and Swin Transformer was used for fault diagnosis of belt conveyor rollers.First, the VMD algorithm was used to realize signal decomposition and reconstruction.Subsequently, the wavelet transform was applied to generate time-frequency domain feature maps from the denoised and reconstructed signals.
Finally, these signal spectral images were fed into a Swin Transformer network to execute feature extraction.The extracted features were finally trained using a classifier to perform fault diagnosis.For a visual representation of the workflow of the proposed method, please refer to Figure 1.The steps were as follows: (1) The VMD algorithm was used to decompose the signal into K modal components, and the Pearson correlation coefficient was used to select the modal component with the most significant correlation coefficient and reconstruct the denoised signal.
(2) The denoised vibration signal was reconstructed and wavelet transform was performed to obtain the time-frequency domain feature image of the reconstructed signal.(3) The image was divided into training, validation, and test sets proportionally.(4) The training and validation sets were fed into the Swin Transformer model for training, and an optimal model was selected.(5) The training was terminated, and the test set was fed into the optimal model for testing.(6) The test results were the output.

VMD Signal Preprocessing Algorithm
The VMD algorithm operates by identifying a collection of modes along with their corresponding center frequencies, allowing these modes to recreate the input signal accurately.At its core, the algorithm expands the classical Wiener filter by incorporating multiple adaptive frequency bands.To efficiently optimize the variational model, the alternating direction multiplier method was employed to enhance the resilience of the model against sampling noise.
Assume that the real-valued signal f can be broken down into k sparse intrinsic modal components u k .Each component u k has its spectrum primarily concentrated within a specific bandwidth centered around ω k .The objective of VMD decomposition is to minimize the bandwidth of each component while maintaining sparsity, essentially aiming for each modal component to be as tightly focused as possible around its center frequency ω k .To achieve this goal, the following function is modeled [26]: The VMD method decomposes the signal， select the modal component with the maximum

VMD Signal Preprocessing Algorithm
The VMD algorithm operates by identifying a collection of modes along with their corresponding center frequencies, allowing these modes to recreate the input signal accurately.At its core, the algorithm expands the classical Wiener filter by incorporating multiple adaptive frequency bands.To efficiently optimize the variational model, the alternating direction multiplier method was employed to enhance the resilience of the model against sampling noise.
Assume that the real-valued signal f can be broken down into k sparse intrinsic modal components uk.Each component uk has its spectrum primarily concentrated within a specific bandwidth centered around ωk.The objective of VMD decomposition is to minimize the bandwidth of each component while maintaining sparsity, essentially aiming for each modal component to be as tightly focused as possible around its center frequency ωk.To achieve this goal, the following function is modeled [26]: In Equation (1), {u k } = {u 1 , u 2 ,. .., u k } are the K modal components, {ω k } = {ω 1 , ω 2 ,. .., ω k } are the center frequencies of each component, * is the sign of convolution operation, and ∂t is the sign of gradient operation.Equation (1) can be simplified to Equation (2).
Introducing the quadratic penalty term α and the Lagrange multiplier operator λ(t), the constrained optimization problem of Equations ( 1) and (2) are converted into an unconstrained problem with an augmented Lagrangian function of the following form [26]: In Equation (3), α is the penalty factor, λ(t) is the Lagrange multiplication operator, and ‹› is the inner product operation.
Once the unconstrained optimization objective function is established, the alternating direction method of multipliers (ADMM) is applied iteratively to seek optimality and determine the modal components.Initially, a set of u k is assumed as a known condition, and the center frequency ω k that minimizes Equation ( 3) is computed.Subsequently, the new center frequency ω k is treated as a known condition, and u k , which minimizes Equation (3), is computed.This process of alternating updates between u k and ω k continues until the algorithm converges or is terminated.Hence, it culminates in the determination of the optimal values for u k and ω k , marking the completion of the VMD decomposition process.
The VMD algorithm adaptively decomposes complex signals through iterative frequencydomain operations, resulting in multiple effective AM-FM signal combinations.This process successfully breaks down non-smooth input signals into K modal components with specific levels of sparsity.During decomposition, the choice of K is critical: selecting an excessively large value leads to over-decomposition, whereas selecting an excessively small value results in under-decomposition.To circumvent these limitations, this study employed the Pearson correlation coefficient to gauge the correlation between the decomposed signal components and the original signal, thereby determining an appropriate K value.The Pearson coefficient, denoted as γ(X, Y), is defined as follows: where X and Y are two random variables, Cov(X, Y) denotes the covariance of X and Y, and σX and σY denote the standard deviation of X and Y, respectively.The value of K is determined using the following procedure: (1) The initial number of modal components is set as K = 2, (K ≤ 10).
(2) The vibration signal is decomposed using VMD. ( means that the VMD is over-decomposed, such that K = K − 1, the loop ends; otherwise, make K = K + 1, and reiterate step (2).
After selecting the number of modal components K, each modal component was obtained via VMD.Subsequently, the Pearson correlation coefficient between each modal component and original signal was calculated.The corresponding modal components were arranged based on these correlation coefficients, and a valid signal was reconstructed.The effect of the noise reduction on the roller-fault signal is plotted in Figure 2.
To simulate the roller's fault data in the presence of substantial background noise, a synthetic signal was generated by combining a robust background noise component with an initial fault impulse signal (Figure 2c).The introduction of the noise component resulted in a cluttered time-domain signal, submerging the time-frequency domain characteristics of the original fault impulse signal (Figure 2d).Consequently, it is challenging to effectively distinguish fault-related time-domain features.To isolate the original vibration signal from the noise-corrupted signal, we employed a VMD signal-to-noise separation algorithm to treat the faulty signal.The time-domain waveform, which was compromised by noise, was subjected to VMD processing with a structure similar to that of the original waveform (Figure 2e).The application of the VMD algorithm diminished the noise component within the reconstructed signal spectrum, resulting in clear time-frequency domain features (Figure 2f).Examining the entirety of Figure 2, it is evident that the VMD, which functions as an adaptive signal separation method, decomposes the signal into multiple intrinsic modal components and selects the signal components that have less noise information in the modal components and are maximally correlated with the original signal to be com-bined.This amalgamation effectively diminishes the noise interference in the reconstituted signals while preserving crucial information.Although certain components of the timedomain signal may cancel each other out along the time axis and remain imperceptible, they become distinctly visible in the time-frequency domain.Consequently, conducting wavelet time-frequency domain analysis on the noise-reduced signal after VMD decomposition and recombination enables the precise extraction of the signal's time-frequency domain information.To simulate the roller's fault data in the presence of substantial background noise, a synthetic signal was generated by combining a robust background noise component with an initial fault impulse signal (Figure 2c).The introduction of the noise component resulted in a cluttered time-domain signal, submerging the time-frequency domain characteristics of the original fault impulse signal (Figure 2d).Consequently, it is challenging to effectively distinguish fault-related time-domain features.To isolate the original vibration signal from the noise-corrupted signal, we employed a VMD signal-to-noise separation algorithm to treat the faulty signal.The time-domain waveform, which was compromised In the VMD signal-to-noise separation algorithm, the value of K is particularly important, and the initial value of K is set to two.The correlation coefficients obtained for each modal component are listed in Table 1.  1 displays the Pearson correlation coefficients corresponding to each intrinsic mode function (IMF) resulting from decomposition.Notably, these coefficients generally increased in each group when K = 2 or 3.However, when K = 4, IMF3 surpassed IMF4; this anomaly arose primarily from the over-decomposition of the signal during decomposition.Conversely, when K was set to 2 or 3, the signal remained incompletely decomposed.Hence, to identify the optimal number of IMFs for the aforementioned simulated signal, we selected the modal component IMF3, which exhibited the highest Pearson coefficient, to reconstruct a useful signal.The effect of varying the K values on the signal-denoising performance was assessed through a sensitivity analysis of the number of modal components K.A comparative evaluation of the time-frequency domain images of the signal for the original noise-added analog signal and the reconstructed signal after VMD decomposition indicated that the VMD method greatly reduced the noise component within the reconstructed signal and enhanced the visibility of fault-related information when employing the optimized K value.

Swin Transformer Network Model Buliding
CNNs have historically been the primary networks for visual processing applications.However, with the introduction of the transformer structure by Vaswani et al. [27], featuring a global receptive field in a single layer, researchers have begun to explore its application in computer vision tasks.Remarkably, the transformer network surpassed the CNN in terms of performance on large image datasets, establishing itself as a viable alternative in the realm of computer vision.This transition has yielded impressive results across numerous computer vision tasks.Considering the vision transformer (ViT) [28] as an example, the ViT captures an image and divides it into multiple fixed-size patches.Subsequently, these patches are linearly transformed into embedding vectors that are processed by multiple encoders for feature extraction, which is a crucial step in image classification.This approach, which segments an image, excels in capturing global information from the image.However, this method exhibits limitations when extracting detailed local image information.In contrast, the Swin Transformer addresses this limitation by enabling the interaction of information across different regions of the entire segmented image.This breakthrough considerably improves the accuracy of image classification by effectively leveraging both global and local information.
The overall architecture of Swin Transformer is illustrated in Figure 3.
The Swin Transformer network model is organized into four stages, each sharing a similar structural composition.In the initial stage, the RGB three-channel image with dimensions H × W undergoes a Patch Partition layer.This layer divides the image into nonoverlapping patches of equal size, each measuring 4 × 4 and comprising three channels.Consequently, after flattening, the feature dimensions are 4 × 4 × 348.Consequently, the input sequence length for the Swin Transformer is represented by H/4 × W/4 = H × W/16 patch tokens, reflecting the effective input sequence length of the model.The overall architecture of Swin Transformer is illustrated in Figure 3.The linear embedding layer, also known as a fully connected layer, is responsible for projecting a tensor with dimensions (H/4 × W/4) × 48 onto any dimension C.This transformation converts the vector dimension to (H/4 × W/4) × C and subsequently feeds the patch token into the Swin Transformer block.This block structure is illustrated in Figure 4b.In the initial Swin Transformer block, the number of input and output tokens remains unchanged at H/4 × W/4, and it collaborates with the linear embedding layer in Stage 1.However, as the network depth increases, the number of tokens decreases because of the merging process implemented by the patch-merging layer.The first patch-merging layer and the Swin Transformer Block of this stage are combined to form Stage 2. The patch-merging layer splices each set of patches with an interval of two, reducing the number of patch tokens to 1/4 of the original size, i.e., H/8 × W/8.In contrast, the dimensions of the patch token expands to 4C.This process is illustrated in Figure 5.To optimize the computational efficiency, a linear layer is applied to perform dimensionality reduction on a patch with a dimension of 4C, reducing the output dimension to 2C, thereby lowering the resolution and minimizing the computational load.To optimize the computational efficiency, a linear layer is applied to perform dimensionality reduction on a patch with a dimension of 4C, reducing the output dimension to 2C, thereby lowering the resolution and minimizing the computational load.Following this reduction, the Swin Transformer block performs a feature transformation.Subsequently, Stages 3 and 4 replicate the same process as Stage 2, with each iteration altering the tensor dimensions to establish a hierarchical representation.In this context, N1, N2, N3, and N4 represent the numbers of Swin Transformer blocks at each stage.Typically, N1, N2, and N4 are configured as 2, whereas N3 may vary depending on the specifics of the training data.

Feature map
Columns are separated in two to select elements Stacking feature maps on channels The Swin Transformer encoder closely resembles the traditional transformer encoder structure (Figure 4a), comprising a multi-head self-attention (MSA) layer, normalization (LN) module, and multilayer perceptron (MLP).The MSA layer is pivotal in extracting information from various dimensions, enhancing feature diversity, preventing overfitting, and ultimately improving the overall model performance.This layer operates by employing different attention heads with distinct Q (query), K (key), and V (value) matrices.These matrices are randomly initialized to project the input vectors into different subspaces.Multiple independent attention heads concurrently process the input, and the resulting vectors are aggregated and mapped to the final output.The mechanism of MSA is as follows [27]: ( , , ), 1 where i is the number of the attention head, which ranges from 1 to h; , and  The Swin Transformer encoder closely resembles the traditional transformer encoder structure (Figure 4a), comprising a multi-head self-attention (MSA) layer, normalization (LN) module, and multilayer perceptron (MLP).The MSA layer is pivotal in extracting information from various dimensions, enhancing feature diversity, preventing overfitting, and ultimately improving the overall model performance.This layer operates by employing different attention heads with distinct Q (query), K (key), and V (value) matrices.These matrices are randomly initialized to project the input vectors into different subspaces.Multiple independent attention heads concurrently process the input, and the resulting vectors are aggregated and mapped to the final output.The mechanism of MSA is as follows [27]: where i is the number of the attention head, which ranges from 1 to h; is the output projection matrix; and Z i refers to the output matrix of each attention head.X and Y denote the input vectors, d k refers to the dimensions of query and key, and d v refers to the dimension of value.The MSA separates the inputs into h-independent attention heads under the action of the d model /h-dimensional vectors and performs multi-head parallel processing of the features.The processed vectors are mapped to the final output using aggregation, which is performed using the Concat function.
As plotted in Figure 4a, the encoder consists of two segments of residual link structures that enhance the flow of image information and contribute to the improved model training accuracy.A common functional module of LN is used within each residual link.The attention mechanism is applied to the first residual link, whereas in the second residual link, the MLP function formula is expressed in Equation ( 8) [27].
The two linear transformation layers with a non-linear activation function from the MLP structure, W 1 and W 2 in the aforementioned equation, are the parameter matrices of the two linear transformation layers.The activation function is usually a Gaussian error linear unit function, denoted by σ in Equation (8), where b 1 and b 2 are bias parameters.
The Swin Transformer block, as depicted in Figure 4b, comprises two encoders connected in series.Essentially, the encoder structure within the Swin Transformer Block remains unchanged compared to the standard transformer encoder, with the exception that the MSA is substituted with window-based window multi-head self-attention (W-MSA).In addition, shifted-window multi-head self-attention (SW-MSA), which is based on the shifted window while retaining other unaltered components, is introduced.Unlike MSA, W-MSA and SW-MSA specifically concentrate on attentional relationships within local windows, resulting in lower computational complexities.However, all three share the same fundamental principles.Furthermore, the structure and function of the LN layer and the MLP module mirror those of the transformer.
As illustrated in Figure 6a, a typical MSA structure involves numerous computations for each element in the feature map.The W-MSA module reduces the computational load by segmenting the feature map into windows of size M × M and separately conducting self-attention computations within each window.In contrast to the conventional windows of the W-MSA, the SW-MSA module utilizes shifted windows to expand the receptive field.Figure 6a illustrates the transformation of the four W-SMA windows into nine SW-MSA windows.The process of shifting the windows is illustrated in Figure 6b.Post-shift, the batch window count remains unchanged; however, certain windows may contain content from the sub-windows that were initially non-adjacent.Hence, the Mask MSA mechanism is initially employed for self-attention calculations.Subsequently, a Mask operation is executed to eliminate undesired attention, confining self-attention calculations to subwindows that require MSA calculations.This enables the model to concentrate solely on specific location information, while disregarding irrelevant location details.Finally, the shift operation is iterated to retrieve the self-attention outcome for a specific window.For instance, in Figure 6b, only window 5 requires no splicing, resulting in the preservation of only the MSA results for this window.W-MSA and SW-MSA were performed in tandem.For instance, regular window segmentation was applied in the l-th coding layer, whereas the l + 1th layer utilized a shifted-window approach.This method involving shiftedwindow segmentation establishes connections between adjacent non-overlapping windows from the previous layer, enhancing the linkage between local features and bolstering the accuracy of the Swin Transformer in image classification.The specific formulas for calculating the Swin Transformer Block are as follows [21]: Gaussian white noise was introduced into the original data samples to simulate noiseaffected data and assess the diagnostic performance of the proposed algorithm on noisy data.The signal-to-noise ratio (SNR) was calculated as follows: where R sn denotes the SNR, P signal denotes the effective power of the signal, and P noise denotes the effective power of noise.
To study the impact of the proposed improved VMD method on the fault identification accuracy, the original noise-added signal (strong background noise superimposed on the original signal) and three groups of signals obtained from the improved VMD decomposition were selected and fed into the swine transformer for fault comparison.The three sets of decomposed input signals can be categorized as follows: (i) contains two modal components corresponding to the maximum Pearson correlation coefficient (i.e., maximum value), K = 2 (denoted as VMDM2); (ii) contains three modal components corresponding to the state component with the most significant Pearson correlation coefficient (i.e., state component), K = 3 (denoted as VMDM3); and (iii) contains four modal components corresponding to the maximum Pearson correlation coefficient (i.e., maximum value), K = 4 (denoted as VMDM4).The number of Swin Transformer blocks in each stage of the Swin Transformer network structure for the four types of models was two, two, six, and two, respectively.The test results are presented in Figure 7.The signal recognition accuracies achieved by the four models varied at different SNR levels.At an SNR of 0 dB, the accuracies stand at 89.02%, 95.30%, 92.08%, and 69.82%, respectively; however, these values dropped to 78.85%, 83.60%, 77.65%, and 56.98%, respectively, when the SNR dropped to −12 dB.Among the four groups of input signals, those with a signal component of three consistently exhibited higher fault recognition accuracy compared to the other three groups.Notably, when utilizing the three most significant signal components, the accuracy of the model experienced a more gradual decline as the noise levels increased.This trend suggests that the recognition accuracy of the model was less susceptible to changes in the SNR.This observation underscores the importance of selecting an appropriate K value during signal decomposition, as variations in K can lead to either under-decomposition or over-decomposition of the signal, both of which can adversely affect the model accuracy.Thus, selecting the right K value is essential for achieving a high model accuracy.To make the Swin Transformer model more adapted to the VMD method, the number of Swin Transformer coding layers was adjusted to test the accuracy of Swin-T, Swin-S, and Swin-L on the test set separately.Swin-T, Swin-S, and Swin-L differred mainly in terms of the output feature map dimension C and number of Swin transformer blocks N3 in Stage 3. The parameters of each model are listed in Table 3.The training accuracies of the three Swin Transformer models for the classification and recognition of the established validation set are illustrated in Figure 8.An in-depth analysis of the results revealed that Swin-T and Swin-S demonstrated faster convergence and higher accuracy as the iterations progressed.In contrast, the Swin-L model exhibited a considerably lower classification accuracy for the validation set.The accuracy of the Swin-T and Swin-S models quickly rose to over 95% and reached a maximum of 99.6% and 99.3%, respectively, while the Swin-L model reached a maximum of 93.8%.In particular, the accuracy curve exhibited more pronounced fluctuations, indicating sub-optimal convergence.Two primary factors contributed to this phenomenon.First, the Swin-L model was inherently more complex than the Swin-T and Swin-S models because of its larger feature map dimensions (C) and greater number of coding layers.This led to an elevated computational demand for the training process.Second, the size of the validation dataset remained constant throughout training, which could lead to overfitting and con- To make the Swin Transformer model more adapted to the VMD method, the number of Swin Transformer coding layers was adjusted to test the accuracy of Swin-T, Swin-S, and Swin-L on the test set separately.Swin-T, Swin-S, and Swin-L differred mainly in terms of the output feature map dimension C and number of Swin transformer blocks N3 in Stage 3. The parameters of each model are listed in Table 3.The training accuracies of the three Swin Transformer models for the classification and recognition of the established validation set are illustrated in Figure 8.An in-depth analysis of the results revealed that Swin-T and Swin-S demonstrated faster convergence and higher accuracy as the iterations progressed.In contrast, the Swin-L model exhibited a considerably lower classification accuracy for the validation set.The accuracy of the Swin-T and Swin-S models quickly rose to over 95% and reached a maximum of 99.6% and 99.3%, respectively, while the Swin-L model reached a maximum of 93.8%.In particular, the accuracy curve exhibited more pronounced fluctuations, indicating sub-optimal convergence.Two primary factors contributed to this phenomenon.First, the Swin-L model was inherently more complex than the Swin-T and Swin-S models because of its larger feature map dimensions (C) and greater number of coding layers.This led to an elevated computational demand for the training process.Second, the size of the validation dataset remained constant throughout training, which could lead to overfitting and consequent fluctuations in the accuracy curves across different training rounds.The sample data covered an SNR range from −12 dB to 8 dB.After decomposing and reconstructing the signal components using VMDM3, they were fed into a Swin Transformer for analysis.The test results are plotted in Figure 9.As the SNR gradually increased, there is a noticeable decrease in the performance of the Swin-L model.Throughout the SNR variations, the diagnostic accuracy curves of the Swin-T and Swin-S models exhibited similar trends, with the Swin-S model consistently achieving a slightly higher accuracy than the Swin-T model.Notably, at an SNR of 0 dB, the Swin-L, Swin-T, and Swin-S models' test accuracies were 89.3%, 95.3%, and 97.02%, respectively, at which time the accuracy of the Swin-S model was 7.72 and 1.72 percentage points higher than that of the Swin-L and Swin-T models, respectively.At an SNR of 8 dB, the accuracy of the Swin-L, Swin-T, and Swin-S models' tests was 91.9%, 98.6%, and 99.3%, respectively, at which time the accuracy of the Swin-S model was 7.2 and 0.7 percentage points higher than that of Swin-L and Swin-T, respectively.Examining Figures 8 and 9 collectively, it is evident that Swin-L exhibited a consistently low and fluctuating accuracy during training.This result is attributed to the excessively complex model structure and computational demands.In contrast, the Swin-T and Swin-S models outperformed the Swin-L model on the test set owing to their simpler architectures and more manageable computational requirements.Comparing the Swin-T and Swin-S models, the latter's strategy increased the number of encoders in Stage 3, resulting in a greater model complexity.Notably, the Swin-S model demonstrated a superior test accuracy compared to the Swin-T model.This discrepancy can be attributed to the distinct architectures of the Swin-S and Swin-T models, which could affect their feature extraction capabilities.With a greater number of coding layers, the Swin-S model excelled in hierarchical feature representation, enabling it to capture information at various scales within an image.However, the Swin-L model experienced a decline in accuracy despite an increase in model complexity.This decline is attributed to the oversized structure of the Swin-L model, resulting in overfitting, in which the model learns numerous noisy or redundant features from the training data that cannot be effectively applied to the test data.Moreover, all three models shared the same dataset, and large-scale models typically require greater computational resources and more refined training strategies for convergence and generalization.Performance degradation can The sample data covered an SNR range from −12 dB to 8 dB.After decomposing and reconstructing the signal components using VMDM3, they were fed into a Swin Transformer for analysis.The test results are plotted in Figure 9.As the SNR gradually increased, there is a noticeable decrease in the performance of the Swin-L model.Throughout the SNR variations, the diagnostic accuracy curves of the Swin-T and Swin-S models exhibited similar trends, with the Swin-S model consistently achieving a slightly higher accuracy than the Swin-T model.Notably, at an SNR of 0 dB, the Swin-L, Swin-T, and Swin-S models' test accuracies were 89.3%, 95.3%, and 97.02%, respectively, at which time the accuracy of the Swin-S model was 7.72 and 1.72 percentage points higher than that of the Swin-L and Swin-T models, respectively.At an SNR of 8 dB, the accuracy of the Swin-L, Swin-T, and Swin-S models' tests was 91.9%, 98.6%, and 99.3%, respectively, at which time the accuracy of the Swin-S model was 7.2 and 0.7 percentage points higher than that of Swin-L and Swin-T, respectively.Examining Figures 8 and 9 collectively, it is evident that Swin-L exhibited a consistently low and fluctuating accuracy during training.This result is attributed to the excessively complex model structure and computational demands.In contrast, the Swin-T and Swin-S models outperformed the Swin-L model on the test set owing to their simpler architectures and more manageable computational requirements.Comparing the Swin-T and Swin-S models, the latter's strategy increased the number of encoders in Stage 3, resulting in a greater model complexity.Notably, the Swin-S model demonstrated a superior test accuracy compared to the Swin-T model.This discrepancy can be attributed to the distinct architectures of the Swin-S and Swin-T models, which could affect their feature extraction capabilities.With a greater number of coding layers, the Swin-S model excelled in hierarchical feature representation, enabling it to capture information at various scales within an image.However, the Swin-L model experienced a decline in accuracy despite an increase in model complexity.This decline is attributed to the oversized structure of the Swin-L model, resulting in overfitting, in which the model learns numerous noisy or redundant features from the training data that cannot be effectively applied to the test data.Moreover, all three models shared the same dataset, and largescale models typically require greater computational resources and more refined training strategies for convergence and generalization.Performance degradation can be prevented by allocating additional resources and implementing suitable training strategies for the Swin-L model.In summary, the increased complexity of the Swin-S model, coupled with its higher accuracy, was likely a result of determining the optimal equilibrium between the model size and training strategy, enabling it to capture data patterns more effectively.Conversely, despite its greater complexity, the Swin-L model exhibited lower accuracy owing to challenges stemming from issues such as sub-optimal architectural design, overfitting, and inadequate training resources.
The ViT, similar to the Swin Transformer, is a variant of the transformer model designed for processing two-dimensional data by incorporating internal encoders equipped with attention mechanisms.The primary distinction lies in the image-processing approach.When ViT receives an image, it directs it through multiple encoders for feature extraction, employing the MSA mechanism to capture the features of the entire image.In contrast, the encoding module of the Swin Transformer introduces a hierarchical processing mechanism that comprises W-MSA and SW-MSA.W-MSA focuses on extracting features from different parts of an image, whereas SW-MSA cascades the features across various image regions.Consequently, the ViT primarily emphasizes global feature extraction, whereas the Swin Transformer excels in capturing both local and global features.
To evaluate the fault recognition accuracy of the proposed methods in the presence of high background noise, we benchmarked our approach against the VMD + CNN [29] and VMD+ViT [30] networks.In this direction, we performed a comparative analysis of fault diagnosis accuracies across a range of SNRs (−12 dB to 8 dB), as illustrated in Figure 10.Notably, for VMD + CNN, the recognition accuracy started at 75.6% when the SNR was at its lowest (−12 dB) value and achieved its highest accuracy of 95.40% at an SNR of 8 dB.In contrast, VMD+ViT exhibited a recognition accuracy of 79.20% at an SNR of −12 dB, which subsequently rose to 97.68% at an SNR of 8 dB.Furthermore, in the SNR environment ranging from −12 dB to 8 dB, the accuracy of the VMD+Swin-S model demonstrated a remarkable improvement, rising from 85.02% to 99.36%.Thus, among the three methods evaluated for noise resilience, VMD+Swin-S emerged as the best performer, followed by VMD+ViT, and VMD + CNN exhibited the lowest robustness.The main characteristics of the three models and their performance in the test are shown in Table 4.In summary, the increased complexity of the Swin-S model, coupled with its higher accuracy, was likely a result of determining the optimal equilibrium between the model size and training strategy, enabling it to capture data patterns more effectively.Conversely, despite its greater complexity, the Swin-L model exhibited lower accuracy owing to challenges stemming from issues such as sub-optimal architectural design, overfitting, and inadequate training resources.
The ViT, similar to the Swin Transformer, is a variant of the transformer model designed for processing two-dimensional data by incorporating internal encoders equipped with attention mechanisms.The primary distinction lies in the image-processing approach.When ViT receives an image, it directs it through multiple encoders for feature extraction, employing the MSA mechanism to capture the features of the entire image.In contrast, the encoding module of the Swin Transformer introduces a hierarchical processing mechanism that comprises W-MSA and SW-MSA.W-MSA focuses on extracting features from different parts of an image, whereas SW-MSA cascades the features across various image regions.Consequently, the ViT primarily emphasizes global feature extraction, whereas the Swin Transformer excels in capturing both local and global features.
To evaluate the fault recognition accuracy of the proposed methods in the presence of high background noise, we benchmarked our approach against the VMD + CNN [29] and VMD+ViT [30] networks.In this direction, we performed a comparative analysis of fault diagnosis accuracies across a range of SNRs (−12 dB to 8 dB), as illustrated in Figure 10.Notably, for VMD + CNN, the recognition accuracy started at 75.6% when the SNR was at its lowest (−12 dB) value and achieved its highest accuracy of 95.40% at an SNR of 8 dB.In contrast, VMD+ViT exhibited a recognition accuracy of 79.20% at an SNR of −12 dB, which subsequently rose to 97.68% at an SNR of 8 dB.Furthermore, in the SNR environment ranging from −12 dB to 8 dB, the accuracy of the VMD+Swin-S model demonstrated a remarkable improvement, rising from 85.02% to 99.36%.Thus, among the three methods evaluated for noise resilience, VMD+Swin-S emerged as the best performer, followed by VMD+ViT, and VMD + CNN exhibited the lowest robustness.The main characteristics of the three models and their performance in the test are shown in Table 4.According to Table 4 and Figure 10, it can be seen that this performance difference can be attributed to the inherent features of these models.For example, the input data of the CNN model was one-dimensional, while the ViT and Swin Transformer are used to process two-dimensional data; compared with one-dimensional data, two-dimensional data can show richer features, and in terms of the feature input, the ViT and Swin Transformer gain more advantages than a CNN.In addition, a CNN is primarily designed for extracting local features and struggles to capture distant features in input time-series data.In particular, when dealing with long sequences of sound vibration signal data, CNNs often require layer-by-layer convolution operations, leading to a proliferation of parameters and computations.In contrast, the ViT model incorporates the MSA mechanism, enabling it to capture global features and discern the varying contributions of different samples to the results.This capability enables the VMD+ViT method to achieve superior fault diagnosis performance.In contrast to the ViT and CNN, the Swin Transformer incorporates a hierarchical processing approach involving window-based W-MSA and SW-MSA.This methodology divides an image into localized regions, applies a self-attention mechanism to each region, and subsequently propagates information through a hierarchical structure.This hierarchical processing enhances the model's ability to capture the local features of the image while effectively processing global information through the synergy of local attention and the hierarchical structure.Consequently, this approach enables the Swin Transformer to comprehensively grasp the intricate features within an image and establish relationships among localized elements, resulting in an improved diagnostic performance  According to Table 4 and Figure 10, it can be seen that this performance difference can be attributed to the inherent features of these models.For example, the input data of the CNN model was one-dimensional, while the ViT and Swin Transformer are used to process two-dimensional data; compared with one-dimensional data, two-dimensional data can show richer features, and in terms of the feature input, the ViT and Swin Transformer gain more advantages than a CNN.In addition, a CNN is primarily designed for extracting local features and struggles to capture distant features in input time-series data.In particular, when dealing with long sequences of sound vibration signal data, CNNs often require layer-by-layer convolution operations, leading to a proliferation of parameters and computations.In contrast, the ViT model incorporates the MSA mechanism, enabling it to capture global features and discern the varying contributions of different samples to the results.This capability enables the VMD+ViT method to achieve superior fault diagnosis performance.In contrast to the ViT and CNN, the Swin Transformer incorporates a hierarchical processing approach involving window-based W-MSA and SW-MSA.This methodology divides an image into localized regions, applies a self-attention mechanism to each region, and subsequently propagates information through a hierarchical structure.This hierarchical processing enhances the model's ability to capture the local features of the image while effectively processing global information through the synergy of local attention and the hierarchical structure.Consequently, this approach enables the Swin Transformer to comprehensively grasp the intricate features within an image and establish relationships among localized elements, resulting in an improved diagnostic performance.The model's adeptness in balancing local and global information extraction positions the Swin Transformer as a superior choice compared to the ViT and CNNs in the realm of diagnosis.The disadvantage of taking into account both global and local features is the more complex structure of the model, so the simpler structure of CNNs and the ViT will be more advantageous in terms of the efficiency of model operation.

Application Example Validation
In this paper, an industrial field faulty belt conveyor roller sound dataset was used to verify the feasibility of the proposed method in practical applications.The types of faults in the test included rusted rollers, abnormal vibrations of rollers, severe cracking of the belt, and normal rollers.The applied rotational speeds during the test were 50 r/min, 200 r/min, and 400 r/min, respectively.The dataset was divided into training, validation, and test sets at a ratio of 7:2:1, and this was inputted into five types of models for the test, namely, VMD+Swin-S, VMD+Swin-T, VMD+Swin-L, and VMD+ViT with VMD + CNN.The test results are shown in Figure 11.
Electronics 2024, 13, x FOR PEER REVIEW 17 of 19 features is the more complex structure of the model, so the simpler structure of CNNs and the ViT will be more advantageous in terms of the efficiency of model operation.

Application Example Validation
In this paper, an industrial field faulty belt conveyor roller sound dataset was used to verify the feasibility of the proposed method in practical applications.The types of faults in the test included rusted rollers, abnormal vibrations of rollers, severe cracking of the belt, and normal rollers.The applied rotational speeds during the test were 50 r/min, 200 r/min, and 400 r/min, respectively.The dataset was divided into training, validation, and test sets at a ratio of 7:2:1, and this was inputted into five types of models for the test, namely, VMD+Swin-S, VMD+Swin-T, VMD+Swin-L, and VMD+ViT with VMD + CNN.The test results are shown in Figure 11.From Figure11, it can be seen that the VMD+Swin-S model had the highest accuracy among all the models at working speeds of 50 r/min, 200 r/min, and 400 r/min with 98.67%, 98.32%, and 97.65% accuracy, respectively.The accuracy of the VMD+Swin-T model was close to that of the VMD+Swin-S model, with 98.05%, 96.3%, and 96.22% accuracy, respectively.The accuracy of both models was ahead of other models under all types of working conditions.The fault diagnosis effect of the VMD+ViT model was better than that of the VMD + CNN model as a whole.Among all the models, the VMD+Swin-L model had the lowest accuracy under three types of working conditions, and the accuracy of this model was only 79.2% under a rotational speed of 400 r/min.According to the analysis shown in Figure 9, it can be seen that this was caused by the model being too complex and an overfitting problem being present.Based on the above results, it can be seen that the VMD+Swin-S model still had a high accuracy under practical applications.

Conclusions
In addressing the challenge of diagnosing belt conveyor roller faults in noisy environments, this paper presents a method that utilizes the VMD and Swin-S model for accurate fault diagnosis of roller sound signals.The following conclusions are drawn from the tests conducted on belt conveyor roller sound data: (1) The optimization of parameters for the VMD and Swin Transformer models can be tailored to specific application scenarios.The VMD signal decomposition and reconstruction method, incorporating modal components with the three highest Pearson correlation coefficients, when combined with the Swin Transformer variant structure Swin-S, demonstrates superior diagnostic accuracy in noisy environments.From Figure 11, it can be seen that the VMD+Swin-S model had the highest accuracy among all the models at working speeds of 50 r/min, 200 r/min, and 400 r/min with 98.67%, 98.32%, and 97.65% accuracy, respectively.The accuracy of the VMD+Swin-T model was close to that of the VMD+Swin-S model, with 98.05%, 96.3%, and 96.22% accuracy, respectively.The accuracy of both models was ahead of other models under all types of working conditions.The fault diagnosis effect of the VMD+ViT model was better than that of the VMD + CNN model as a whole.Among all the models, the VMD+Swin-L model had the lowest accuracy under three types of working conditions, and the accuracy of this model was only 79.2% under a rotational speed of 400 r/min.According to the analysis shown in Figure 9, it can be seen that this was caused by the model being too complex and an overfitting problem being present.Based on the above results, it can be seen that the VMD+Swin-S model still had a high accuracy under practical applications.

Conclusions
In addressing the challenge of diagnosing belt conveyor roller faults in noisy environments, this paper presents a method that utilizes the VMD and Swin-S model for accurate fault diagnosis of roller sound signals.The following conclusions are drawn from the tests conducted on belt conveyor roller sound data: (1) The optimization of parameters for the VMD and Swin Transformer models can be tailored to specific application scenarios.The VMD signal decomposition and reconstruction method, incorporating modal components with the three highest Pearson correlation coefficients, when combined with the Swin Transformer variant structure Swin-S, demonstrates superior diagnostic accuracy in noisy environments.
(2) The VMD+Swin Transformer model proved to be very effective in fault diagnosis, even in the presence of large amounts of noise.By adjusting the parameters of the coding layer of the Swin Transformer, the accuracy of the VMD+Swin-S model reached 99.36% when testing the drum with centralized sound data, highlighting its precision and robustness.The accuracy of the VMD+Swin-T model also reached 98.6%.(3) In various noise condition tests, the accuracy of the VMD + CNN and VMD+ViT methods was 95.4% and 97.68%, respectively.The VMD+Swin-S model proposed in this paper was better than the VMD + CNN and VMD+ViT methods in terms of its fault diagnosis accuracy of belt conveyor rollers.In the example application test, the accuracy of the VMD+Swin-S model at the three rotational speeds was 98.67%, 98.32%, and 97.65%, respectively, which was the highest among the models compared.This approach is well-suited for the precise diagnosis and classification of fault categories in belt conveyor rollers operating in strong noise environments.
In this study, VMD was combined with Swin-S to obtain an optimal model that can recognize roller faults in a strong noise environment.However, the principle of the Swin Transformer model is more complicated, and its application on large-scale images may face the challenge of computational and storage resources.Therefore, lightweighting the Swin Transformer model and optimizing the computational efficiency and the number of parameters of the Swin Transformer, while ensuring the testing accuracy of the model for more efficient training efficiency and diagnostic accuracy, can be a direction for further research.

Figure 2 .
Figure 2. VMD decomposition and reconstruction signal effect.(a) Original time-domain signal waveform.(b) Original signal time-domain characteristics.(c) Time-domain signal waveform after adding noise.(d) Time-frequency domain characteristics after adding noise.(e) Time-domain signal waveform after reconstruction.(f) Time-frequency domain characteristics after reconstruction.

Figure 2 .
Figure 2. VMD decomposition and reconstruction signal effect.(a) Original time-domain signal waveform.(b) Original signal time-domain characteristics.(c) Time-domain signal waveform after adding noise.(d) Time-frequency domain characteristics after adding noise.(e) Time-domain signal waveform after reconstruction.(f) Time-frequency domain characteristics after reconstruction.

Figure 3 .
Figure 3. Swin Transformer structure diagram.The Swin Transformer network model is organized into four stages, each sharing a similar structural composition.In the initial stage, the RGB three-channel image with dimensions H × W undergoes a Patch Partition layer.This layer divides the image into nonoverlapping patches of equal size, each measuring 4 × 4 and comprising three channels.Consequently, after flattening, the feature dimensions are 4 × 4 × 348.Consequently, the input sequence length for the Swin Transformer is represented by H/4 × W/4 = H × W/16 patch tokens, reflecting the effective input sequence length of the model.The linear embedding layer, also known as a fully connected layer, is responsible for projecting a tensor with dimensions (H/4 × W/4) × 48 onto any dimension C.This transformation converts the vector dimension to (H/4 × W/4) × C and subsequently feeds the patch token into the Swin Transformer block.This block structure is illustrated in Figure4b.In the initial Swin Transformer block, the number of input and output tokens remains unchanged at H/4 × W/4, and it collaborates with the linear embedding layer in Stage 1.However, as the network depth increases, the number of tokens decreases because of the merging process implemented by the patch-merging layer.

Figure 4 .
Figure 4. Different Transformer models encoder structures.(a) Traditional Transformer Encoder.(b) Swin Transformer Block.The first patch-merging layer and the Swin Transformer Block of this stage are combined to form Stage 2. The patch-merging layer splices each set of patches with an interval

Figure 3 .
Figure 3. Swin Transformer structure diagram.The Swin Transformer network model is organized into four stages, each sharing a similar structural composition.In the initial stage, the RGB three-channel image with dimensions H × W undergoes a Patch Partition layer.This layer divides the image into nonoverlapping patches of equal size, each measuring 4 × 4 and comprising three channels.Consequently, after flattening, the feature dimensions are 4 × 4 × 348.Consequently, the input sequence length for the Swin Transformer is represented by H/4 × W/4 = H × W/16 patch tokens, reflecting the effective input sequence length of the model.The linear embedding layer, also known as a fully connected layer, is responsible for projecting a tensor with dimensions (H/4 × W/4) × 48 onto any dimension C.This transformation converts the vector dimension to (H/4 × W/4) × C and subsequently feeds the patch token into the Swin Transformer block.This block structure is illustrated in Figure4b.In the initial Swin Transformer block, the number of input and output tokens remains unchanged at H/4 × W/4, and it collaborates with the linear embedding layer in Stage 1.However, as the network depth increases, the number of tokens decreases because of the merging process implemented by the patch-merging layer.

Figure 4 .
Figure 4. Different Transformer models encoder structures.(a) Traditional Transformer Encoder.(b) Swin Transformer Block.The first patch-merging layer and the Swin Transformer Block of this stage are combined to form Stage 2. The patch-merging layer splices each set of patches with an interval
Following this reduction, the Swin Transformer block performs a feature transformation.Subsequently, Stages 3 and 4 replicate the same process as Stage 2, with each iteration altering the tensor dimensions to establish a hierarchical representation.In this context, N1, N2, N3, and N4 represent the numbers of Swin Transformer blocks at each stage.Typically, Electronics 2024, 13, 34 9 of 18 N1, N2, and N4 are configured as 2, whereas N3 may vary depending on the specifics of the training data.

Figure 5 .
Figure 5.To optimize the computational efficiency, a linear layer is applied to perform dimensionality reduction on a patch with a dimension of 4C, reducing the output dimension to 2C, thereby lowering the resolution and minimizing the computational load.Following this reduction, the Swin Transformer block performs a feature transformation.Subsequently, Stages 3 and 4 replicate the same process as Stage 2, with each iteration altering the tensor dimensions to establish a hierarchical representation.In this context, N1, N2, N3, and N4 represent the numbers of Swin Transformer blocks at each stage.Typically, N1, N2, and N4 are configured as 2, whereas N3 may vary depending on the specifics of the training data.
is the output projection matrix; and i Z refers to the output matrix of each attention head.X and Y denote the input vectors, k d refers to the dimensions of query and key, and v d refers to the dimension of value.The MSA separates the inputs into h-independent attention heads under the action of the / model d h -dimensional vectors and performs multi-head parallel processing of the features.The processed vectors are mapped to the final output using aggregation, which is performed using the Concat function.

Figure 6 .
Figure 6.Shifted window procedure.(a) W-MSA in layer l and SW-MSA in layer l + 1.(b) Shift configuration batch calculations.

Figure 6 .
Figure 6.Shifted window procedure.(a) W-MSA in layer l and SW-MSA in layer l + 1.(b) Shift configuration batch calculations.

Figure 7 .
Figure 7. Influence of different modal component selections on diagnostic accuracy.

Figure 7 .
Figure 7. Influence of different modal component selections on diagnostic accuracy.

Figure 8 .
Figure 8. Training accuracy of various Swin Transformer models.

Figure 8 .
Figure 8. Training accuracy of various Swin Transformer models.

Figure 9 .
Figure 9. Diagnostic accuracy of various Swin Transformer models.

Figure 10 .
Figure 10.Comparison of diagnosis accuracy between models.

Figure 11 .
Figure 11.Comparison of fault identification accuracy of models at different rotational speeds.

Figure 11 .
Figure 11.Comparison of fault identification accuracy of models at different rotational speeds.

Table 1 .
Correlation coefficient of "K" value corresponding to each modal component.

Table 3 .
Parameters for various Swin Transformer models.

Table 3 .
Parameters for various Swin Transformer models.

Table 4 .
Comparison of characteristics and test accuracy of three models.

Table 4 .
Comparison of characteristics and test accuracy of three models.