A Siamese Vision Transformer for Bearings Fault Diagnosis

Fault diagnosis methods based on deep learning have progressed greatly in recent years. However, the limited training data and complex work conditions still restrict the application of these intelligent methods. This paper proposes an intelligent bearing fault diagnosis method, i.e., Siamese Vision Transformer, suiting limited training data and complex work conditions. The Siamese Vision Transformer, combining Siamese network and Vision Transformer, is designed to efficiently extract the feature vectors of input samples in high-level space and complete the classification of the fault. In addition, a new loss function combining the Kullback-Liebler divergence both directions is proposed to improve the performance of the proposed model. Furthermore, a new training strategy termed random mask is designed to enhance input data diversity. A comparative test is conducted on the Case Western Reserve University bearing dataset and Paderborn dataset and our method achieves reasonably high accuracy with limited data and satisfactory generation capability for cross-domain tasks.


Introduction
Bearings, as core components of rotating mechanisms, are widely applied in industrial fields.Faults are highly likely to cause entire mechanical system damage and threat to the safety of employees [1][2][3][4].Playing a crucial role in the maintenance of mechanical equipment, many fault diagnosis methods have been proposed.Traditional signal-based mechanical fault diagnosis methods commonly require manual feature extraction based on knowledge and prior experience [5].In recent years, deep learning has made progress in many areas, such as computer vision [6,7], natural language processing [8,9] and defect detection [10,11].Therefore, a large number of fault diagnosis methods based on deep learning have been developed.Zhao et al. [12] designed a novel intelligent fault diagnosis method for diagnosing accurately and steadily rolling bearing faults.Their approach was validated on experimental and practical bearing data.Zhang et al. [13] built a novel neural network that uses raw temporal signals as input.Their method achieved high accuracy under complex working conditions.He et al. [14] proposed a bearing fault diagnosis method based on a new strategy's sparse auto-encoder whose weights were assigned.Hu et al. [15] proposed a new method using tensor-aligned invariant subspace learning and convolutional neural networks for cross-domain bearings fault diagnosis.Zhu et al. [16] proposed a new fault diagnosis approach based on principal component analysis and deep belief network.The time-consuming and unreliable manual feature extraction method is gradually being replaced by deep learning methods [5,[17][18][19][20].
However, deep learning-based methods usually require a large amount of data for model training.Collecting a considerable amount of data for every type of failure under each working condition poses a considerable challenge in actual industrial application scenarios.Some studies of mechanical fault diagnosis have been conducted using limited data.In [21], Zhang et al. applied the Siamese network to fault diagnosis and designed a Siamese CNN model reporting good performance with limited training samples.A novel method termed meta-learning fault diagnosis framework was proposed by Li et al. [22] and performed excellently under complex working conditions.Li et al. [23] designed a deep balanced domain adaptation neural network achieving exciting results using limited labeled training data.Hang et al. [24] used principal component analysis and a two-step clustering algorithm to develop performance in a high-dimensional unbalanced training dataset.A new fault diagnosis approach based on generative adversarial network (GAN) and stacked denoising auto-encoder (SDAE) was proposed by Fu et al. [25], the experimental results representing high diagnosis accuracy under various working conditions.The Feature Space Metric-based Meta-learning Model (FSM3) was designed by Wang et al. [26] to address the challenge of limited training samples.Lu et al. [27] proposed a new cross-domain DC series fault detection framework based on Lightweight Transfer Convolutional Neural Networks.A new support vector data description based on machine learning was proposed by Duan et al. [28] for limited data.Huang et al. [29] proposed a novel method for bearings fault diagnosis under actual conditions and reported that their model achieved good performance under limited data with noise labels.Bai et al. [30] proposed a novel method for bearing fault diagnosis using multi-channel convolution neural network (MCNN) and a multiscale clipping fusion(MSCF) data augmentation algorithm to suit the challenge of limited sensor data.
At the same time, conventional learning-based methods usually assume that training data and testing data are independent and identically distributed.However, it is impractical to collect sufficient data with the same distribution of test data coming from complex work conditions.This requires the training data to cover all possible operating conditions: different working loads, speeds, noise and so on.Such strict assumptions hinder the application of intelligent fault diagnosis methods in actual industry.From a realistic perspective, the training data are usually collected from specific operating conditions, different but similar equipment, or software fault simulations, which may cause different distributions from tested data.Intelligent diagnosis techniques with a strong in-distribution assumption can fail when differences develop.In recent years, numerous research studies have produced a variety of cross-domain diagnosis methods based on transfer learning or domain adaptation employing data with inconsistencies from various source domains to break the identically distributed assumption [31,32].These studies' fundamental principle is to build a diagnostic model that can effectively perform in the target domain using the knowledge of the relevant source domain.Exciting performance enhancements have been made in a variety of cross-domain scenarios, such as in various work conditions [33,34] and across different equipment [19,35].Zhang et al. [34] propose a conditional adversarial domain generalization aiming to extract domain-invariant features from the different source domains and generalize to unseen target domains.Li et al. [34] implemented adversarial domain training to extra generalized features learned from different domains to hold in new working scenarios.Zheng et al. [36] combine priori knowledge and deep domain generalization network for fault diagnosis.
Although the above methods have achieved exciting results in both research directions, studies that put limited data and domain generalization into a unified framework are rare.
In recent years, Transformer has achieved great success in natural language processing and computer vision.Ding et al. [37] applied Transformer to fault diagnosis of rolling bearings and proposed a novel method termed time-frequency Transformer which achieved satisfactory performance.Weng et al. [38] designed a one-dimensional Vision Transformer with Multiscale Convolution Fusion (MCF-1DViT) combining CNN and Vision Transformer for bearing fault diagnosis.They reported that their method can significantly improve diagnosis accuracy and anti-noise ability.Tang et al. [39] introduced integrated learning into the Vision Transformer model for bearing fault diagnosis and achieved good results.The exciting performance of these methods shows the great potential of Transformer in the field of fault diagnosis.
In the current study, we propose a novel fault diagnosis method to improve the model's generation ability to face the two challenges, i.e., limited training data and domain generation for rolling bearings.First, the time-series signal is converted into a time-frequency graph with short-time Fourier transform (STFT).Second, a Siamese Vision Transformer (SViT) is designed to extract feature vectors efficiently and implement classification tasks.In addition, we design a new loss function, bidirectional Kullback-Liebler divergence (DKLD), to improve the performance of the proposed model.A new training strategy, i.e., the random mask, is also proposed to reduce the overfitting risk of the model.The contributions of this study include the following.The remainder of this paper is organized as follows.Section 2 details our method, including the Siamese networks, Vision Transformer, the new loss function bidirectional KL divergence and random mask strategy.Section 3 presents the experiments, results and discussion.Finally, conclusions are drawn in Section 4.

The Framework of the Proposed Method
As shown in Figure 1, the proposed method is a Siamese-based neural network using an improved vision transformer as the backbone.The inputs are a pair of time-frequency graphs obtained from raw vibration signals through STFT.First, the time-frequency graphs are divided into 8 × 8 patches.After that, the patches are fed into the Random mask layer r masking the input patches with a random rate p.Second, the 2D patches are flattened into 1D vectors through linear projection.Then the class token (a trainable vector with the same sizeas a patch) is concatenated in the font of the flattened vectors.At the same time, the positional encoders are added to the vectors.Third, the series vectors are fed into the transformer encoder constructed with two transformer encoder layers.At the top of the network, the class token outputs are used to calculate the distance of the two input time-frequency graphs.The details of the layers are shown in Table 1.

Data Processing
Short-time Fourier transform (STFT) uses a fixed-length nonzero window function to slide along the time axis, truncating the signal int o segments with the same length.Fourier transform can be used to obtain the local frequency spectra of the segments, assuming that these segments are stable.A 2D time-frequency graph is obtained by recombining these local frequency spectra along the time axis.The formula is presented in Equation (1).
where ( ) x t is the original signal and ( ) is the window function applied with the center point at the time  .

Siamese Network
The Siamese network algorithm was proposed by Bromley et al. [40,41] for detecting forged signatures in 1994.A typical Siamese network consists of two twin networks with the same structure and parameters.The two networks receive different inputs and are connected by an energy function calculating a metric in high-level feature space.As shown in Figure 2, tying the weights of the two subnetworks ensures that two highly similar inputs are not mapped onto extremely different positions in the feature space by their respective networks.Besides, the network is symmetrical.Thus, whenever two different inputs are presented to the twin network, the top connection layer calculates the same metric, just as the same inputs are inputted into the opposite twin network.The Siamese network can make full use of the limited training samples to achieve efficient feature extraction using the same or different sample pairs as the training samples.

Data Processing
Short-time Fourier transform (STFT) uses a fixed-length nonzero window function to slide along the time axis, truncating the signal int o segments with the same length.Fourier transform can be used to obtain the local frequency spectra of the segments, assuming that these segments are stable.A 2D time-frequency graph is obtained by recombining these local frequency spectra along the time axis.The formula is presented in Equation (1).
where x(t) is the original signal and g(t − τ) is the window function applied with the center point at the time τ.

Siamese Network
The Siamese network algorithm was proposed by Bromley et al. [40,41] for detecting forged signatures in 1994.A typical Siamese network consists of two twin networks with the same structure and parameters.The two networks receive different inputs and are connected by an energy function calculating a metric in high-level feature space.As shown in Figure 2, tying the weights of the two subnetworks ensures that two highly similar inputs are not mapped onto extremely different positions in the feature space by their respective networks.Besides, the network is symmetrical.Thus, whenever two different inputs are presented to the twin network, the top connection layer calculates the same metric, just as the same inputs are inputted into the opposite twin network.The Siamese network can make full use of the limited training samples to achieve efficient feature extraction using the same or different sample pairs as the training samples.As shown in Equation ( 2), f is the hidden layer of the model.The output layer is a fully connected layer that uses the distance feature vector as input and outputs the probability that two input data belong to the same category.This layer is obtained using Equation (3), where simg is the sigmoid function and FC represents the fully connected layer.
( , ) ( ( ( , ))) The network is optimized with an Adam optimizer, which adaptively sets the learning rate for each parameter.

Vision Transformer
A transformer is a neural network model that completely relies on a self-attention mechanism to maintain the relationship between input and output [42].Because of the parallel architecture, which is different from the sequential structure of the traditional recurrent neural network, the transformer can consider the global information comprehensively and be trained in parallel.The architecture of the transformer model is depicted in Figure 3 and primarily comprises an encoder, a decoder and a positional embedding layer.To help the transformer address the issue of long-term dependency more effectively, positional embedding is utilized to add the relative positioning information of the input data to the data processed by the embedding layer.The transformer performs well in many time series tasks based on the above advantages.However, due to the computational complexity of the self-attention mechanism, it requires more memory and computational power in the training and prediction process.Considering the information redundancy between adjacent pixels, to reduce the computational complexity of the model the vision transformer (ViT) was proposed in [43].As shown in Equation ( 2), f is the hidden layer of the model.The output layer is a fully connected layer that uses the distance feature vector as input and outputs the probability that two input data belong to the same category.This layer is obtained using Equation ( 3), where simg is the sigmoid function and FC represents the fully connected layer.
The network is optimized with an Adam optimizer, which adaptively sets the learning rate for each parameter.

Vision Transformer
A transformer is a neural network model that completely relies on a self-attention mechanism to maintain the relationship between input and output [42].Because of the parallel architecture, which is different from the sequential structure of the traditional recurrent neural network, the transformer can consider the global information comprehensively and be trained in parallel.The architecture of the transformer model is depicted in Figure 3 and primarily comprises an encoder, a decoder and a positional embedding layer.To help the transformer address the issue of long-term dependency more effectively, positional embedding is utilized to add the relative positioning information of the input data to the data processed by the embedding layer.The transformer performs well in many time series tasks based on the above advantages.However, due to the computational complexity of the self-attention mechanism, it requires more memory and computational power in the training and prediction process.Considering the information redundancy between adjacent pixels, to reduce the computational complexity of the model the vision transformer (ViT) was proposed in [43].
Due to its global information sensing capability, ViT achieves exciting performance in the field of image and vision recognition.The structure of the ViT model consists of a projection of flattened patches, a transformer encoder and a classification head.The input image is first divided into a series of patches.These image patches are then passed through an embedding layer and output vectors of a specific length.To preserve the positional relationship of the input image, position embeddings of the same size as embedded vectors are added to the image patches.The sequence of image patches is passed to the transformer encoder, mainly composed of a multi-head attention layer and an MLP layer.The multihead attention layer extracts different levels of self-attention information from the input through each head.The output of the class token is fed to the MLP head to give the classification result.Due to its global information sensing capability, ViT achieves exciting performance in the field of image and vision recognition.The structure of the ViT model consists of a projection of flattened patches, a transformer encoder and a classification head.The input image is first divided into a series of patches.These image patches are then passed through an embedding layer and output vectors of a specific length.To preserve the positional relationship of the input image, position embeddings of the same size as embedded vectors are added to the image patches.The sequence of image patches is passed to the transformer encoder, mainly composed of a multi-head attention layer and an MLP layer.The multi-head attention layer extracts different levels of self-attention information from the input through each head.The output of the class token is fed to the MLP head to give the classification result.

Patch Embedding Layer
The Patch Embedding Layer transforms a conventional visual problem into a seq2seq problem through image segmentation and linear projection.As shown in Equation ( 4), suppose the input image , where h , w , c represent the image's height, width and channel, respectively.(*) P is the dividing operation and (*) Concat is the operation of vector concatenate and denotes the input of the transformer encoder, where _ cls token is a learnable parameter with the same size as the mapped vector and the positional coders ( _ position coder ) of the image patches are added to the vector space.

Patch Embedding Layer
The Patch Embedding Layer transforms a conventional visual problem into a seq2seq problem through image segmentation and linear projection.As shown in Equation ( 4), suppose the input image x ∈ R h×w×c , where h, w, c represent the image's height, width and channel, respectively.P( * ) is the dividing operation and x p ∈ R N×(p×p×c) denotes the sequence of the divided image, where N, p represent the number of image patches and width of a patch, respectively.L( * ) is the linear projection and x p ∈ R N×D denotes the projected vectors, where D represent the dimension of vector space.Concat( * ) is the operation of vector concatenate and z ∈ R (N+1) * D denotes the input of the transformer encoder, where cls_token is a learnable parameter with the same size as the mapped vector and the positional coders (position_coder) of the image patches are added to the vector space.
x p = P(x)

Transformer Encoder
A transformer encoder layer is composed of multiple identical stacked module layers.It mainly contains two sub-layers, i.e., the multi-head self-attention layer and the MLP feedforward layer.In order to improve the stability of the model in training, each sub-layer is connected internally using residual and layer normalization.

• MLP layer
The structure of the MLP is shown in Figure 4, including a fully connected layer, GELU activation function and dropout.In ViT, the Gaussian error linear unit (GELU) activation function is used in the feedforward layer.GELU activation function is expressed as Equation (5).

MLP Head
The MLP header layer consists of a fully connected layer and an activation for the classification task of diagnosing faults.In this study, the class token vec cessed by the transformer encoder is fed to the MLP header and the probability each fault category is obtained through the SoftMax function.The final fault ca obtained according to the maximum probability value.

Bidirectional KL Divergence
Kullback-Liebler (KL) divergence measures the similarity of a probability tion to a reference probability distribution [44,45].A KL divergence of 0 indicates two distributions are the same.For discrete probability distributions P and Q in the same probability space, the KL divergence [46] from Q to P is defined tion (9): , By contrast, the KL divergence from P to Q is defined as Equation ( 10): , Equations ( 8) and ( 9) clearly show that the KL divergence is asymmetric.A in Equation ( 8), in the KL divergence from Q to P, when 0 i P  , regardless of the

• Multiheaded self-attention layer
The self-attention mechanism enables the network model to extract globally valid features, but the single-head attention mechanism can only learn the feature representation of a single representation space.In order to comprehensively extract remote features from global images, the multi-head self-attention mechanism is used to combine features from different feature subspaces.
The calculation formula of self-attention is written as Equations ( 6) and (7).
where Q, K, V is the query matrix, key matrix and value matrix, respectively.These matrices are calculated by multiplying the feature matrix X with the learnable matrices W, d denotes the dimension of Q, K and V.The multi-head self-attention mechanism uses multiple self-attention heads to learn features from different representation subspaces and finally integrates these subspace features through linear mapping.The multi-head self-attention mechanism can be expressed as Equation (8).
where Concat( * ) is the operation of concatenate and W denotes the weight matrix of projection.

MLP Head
The MLP header layer consists of a fully connected layer and an activation function for the classification task of diagnosing faults.In this study, the class token vector processed by the transformer encoder is fed to the MLP header and the probability value of each fault category is obtained through the SoftMax function.The final fault category is obtained according to the maximum probability value.

Bidirectional KL Divergence
Kullback-Liebler (KL) divergence measures the similarity of a probability distribution to a reference probability distribution [44,45].A KL divergence of 0 indicates that the two distributions are the same.For discrete probability distributions P and Q defined in the same probability space, the KL divergence [46] from Q to P is defined as Equation ( 9): By contrast, the KL divergence from P to Q is defined as Equation ( 10): Equations ( 8) and ( 9) clearly show that the KL divergence is asymmetric.As shown in Equation ( 8), in the KL divergence from Q to P, when P i = 0, regardless of the value of Q i , P i log P i Q i = 0.In the two-classification problem, the loss function can only proceed to one term ( To fully measure the difference between the label and the predicted value, we design a new loss function, called bidirectional KL divergence (DKLD), as shown in Equation (11), where represents the label value and Q i is the predicting probability of the model.
The iteration of gradient descent updates the parameters as shown in Equation ( 12): where W is the model's weight, b is the bias and α is the learning rate.P is a constant and the gradient can be calculated as Equation (13).
Compared with the gradient of the cross-entropy loss function, as shown in Equation ( 14), the gradient of DKLD has an additional coefficient 1 . This coefficient contributes to the gradient regardless of whether P approaches 0 or 1.We expect that this characteristic of DKLD can help to improve the performance of the model in cases with limited training samples.To prevent calculation errors, we limit the value P to [0.001, 1] during calculations.
The comparison between DKLD and cross-entropy is presented in Table 2.

Random Mask Strategy
Similar to dropout, the mask strategy randomly deactivates neuron units in each forward propagation with probability p during training.Unlike the dropout utility neuron units, the mask strategy has larger operation granularity and the operating object in this paper is a patch.The deactivated neurons in low-level layers will affect high-level neurons.Applying mask strategy directly to the input layer can achieve the effect of data augmentation and ensemble learning at the same time.Mask is applied on input amounts to feed the input image cropped randomly and irregularly.
Masking patches with a specific distribution was not enough.Motivated by [47,48], we randomly changed the mask rate on each forward propagation to obtain a new input image with the uncertain feature.In this paper, the mask rate p ∼ Uni f orm(0.5, 0.9).The visualization of the random mask strategy is illustrated in Figure 4.

Experimental Setup
We set up a series of experiments to verify the prediction accuracy and generation ability of SViT on the Case Western Reserve University (CWRU) bearing datasets [49,50] and Paderborn bearing dataset [51].The test platform is an Ubuntu 18.04, Python 3.7 and Pytorch with an Intel ® CORE™ i7 CPU and an Nvidia GTX 3060 GPU.

Comparison Models and Evaluation Metric
As shown in Table 3, the proposed model was compared with WDCNN, the Siamese CNN, PSDAN, FSM3, DeIN and HCAE.WDCNN, in which the first layer is a wide convolution kernel proposed in [24].The Siamese CNN was designed by Zhang et al. [29].PSDAN, FSM3, DeIN and HCAE and were proposed in [26,[52][53][54], respectively.The details of the comparison methods are shown in Table 4.The SViT model was proposed by our team and the parameters of the comparison models are listed in Table 1.Accuracy, precision, recall and F1 score are used to evaluate the performance of the proposed model.They can be obtained by the following equations: where TP, FP, TN, FN represent true positive, false positive, true negative and false negative, respectively.

Case Study 1: CWRU Bearing Datasets
To verify the performance of the proposed method, the 12k drive-end bearing fault data in the CWRU bearing datasets are selected as the original experimental data.Data are collected from vibration signals, as shown in Figure 5. Table 5 shows four types of faults in these data: normal, ball fault, inner race fault and outer race fault.Each fault has three subtypes: 0.007 inches, 0.014 inches and 0.021 inches.Thus, we have 10 different fault   7, in the cases with a small number of training samples, DKLD significantly improves the model's performance compared with that of crossentropy.For example, when the sample size is 60 and 90, the accuracy rates of using DKLD are 1.33% and 0.56% higher than that of using cross-entropy, respectively.When the training sample size is increased to 120 and above, the performance of the two-loss functions is exceptionally close, reaching more than 99%.
To improve the understanding of the effect of DKLD, we use t-distributed stochastic neighbor embedding (t-SNE) to visualize the output of the last hidden fully connected layer of the model trained with DKLD and cross-entropy in 60 sample sizes.As shown in Figure 8a,b, the features of DKLD are more divisible than cross-entropy, particularly in the 1 and 3 categories.Figure 8c,d shows the confusion matrix of the results.

Evaluating the Effectiveness of DKLD
We set up a series of comparative experiments by randomly selecting 60, 90, 120, 200, 300, 600, 900, 1500, 6000 and 19,800 samples from datasets A, B and C. Each experiment uses 60% of the samples as the training set and the remaining samples as the validation set.To verify the proposed DKLD loss function's effectiveness, we use DKLD and crossentropy to train our model separately with different samples size and then compare the test results.As shown in Figure 7, in the cases with a small number of training samples, DKLD significantly improves the model's performance compared with that of cross-entropy.For example, when the sample size is 60 and 90, the accuracy rates of using DKLD are 1.33% and 0.56% higher than that of using cross-entropy, respectively.When the training sample size is increased to 120 and above, the performance of the two-loss functions is exceptionally close, reaching more than 99%.
Micromachines 2022, 13, x FOR PEER REVIEW 13 of 23  Accuracy(%) To improve the understanding of the effect of DKLD, we use t-distributed stochastic neighbor embedding (t-SNE) to visualize the output of the last hidden fully connected layer of the model trained with DKLD and cross-entropy in 60 sample sizes.As shown in Figure 8a,b, the features of DKLD are more divisible than cross-entropy, particularly in the 1 and 3 categories.Figure 8c,d shows the confusion matrix of the results.

The Effect of the Number of Transformer Encoder Layers
To observe the effect of the number of transformer encoders, we tested the performance of the proposed model with the different number of transformer encoders in the cross-domain experiment from dataset C to dataset A (the most difficult cross-domain task [21]).As shown in Figure 9, the proposed model achieved the best performance with two transformer encoders.SViT with two transformer encoders is implemented in follow-up experiments.

The Effect of the Number of Transformer Encoder Layers
To observe the effect of the number of transformer encoders, we tested the performance of the proposed model with the different number of transformer encoders in the cross-domain experiment from dataset C to dataset A (the most difficult cross-domain task [21]).As shown in Figure 9, the proposed model achieved the best performance with

Ablation Experiments
To verify the effectiveness of the Random Mask strategy and Siamese net structure, we set up ablation experiments on cross-domain with 600 training samples proposed method is removed the Random mask strategy and Siamese network stru in turn.When the Siamese network structure is removed, the distance layer is inste a fully connected classifier.
As shown in Table 7, (w/o) means without.It can be seen that the Random strategy and the Siamese network effectively improve the robustness of the mod

Ablation Experiments
To verify the effectiveness of the Random Mask strategy and Siamese network structure, we set up ablation experiments on cross-domain with 600 training samples.The proposed method is removed the Random mask strategy and Siamese network structure in turn.When the Siamese network structure is removed, the distance layer is instead of a fully connected classifier.
As shown in Table 7, (w/o) means without.It can be seen that the Random mask strategy and the Siamese network effectively improve the robustness of the model in cross-domain tasks.Implementing the same experimental setup as above, we evaluate the performance of various methods by using different numbers of training samples.We repeat the sample selection process five times for each sample size to generate different training sets to reduce the bias when randomly selecting a small training set.For each random training sample set, we repeat the algorithm training four times to address the randomness of the algorithm.Each series of experiments is repeated 20 times.We use one-shot testing in the Siamese CNN and our method.
Figure 10 clearly shows that as the amount of training samples increases, the accuracy of all methods also increases, but their standard deviation decreases.This shows the sensitivity of the intelligent fault diagnosis method based on deep learning to the amount of training data.Subsequently, we check whether the proposed SViT model's accuracy is better than those of the other models in the cases with limited training samples (e.g., 60 and 90).In both cases, our model performs better than the other models.Simultaneously, the experimental results indicate that when the training sample size is increased to 900 and above, all the algorithms' performance becomes increasingly similar and their accuracy rates are all higher than 97%.This comparison proves that the proposed SViT exhibits significant advantages over the comparison algorithms in cases with limited training samples.Even in the case of 60 training samples, the proposed algorithm's accuracy rate still reaches 97.56%.Subsequently, we check whether the proposed SViT model's accuracy is better than those of the other models in the cases with limited training samples (e.g., 60 and 90).In both cases, our model performs better than the other models.Simultaneously, the experimental results indicate that when the training sample size is increased to 900 and above, all the algorithms' performance becomes increasingly similar and their accuracy rates are all higher than 97%.This comparison proves that the proposed SViT exhibits significant advantages over the comparison algorithms in cases with limited training samples.Even in the case of 60 training samples, the proposed algorithm's accuracy rate still reaches 97.56%.

Performance in Noisy Environment
In this experiment, we evaluate the performance of the proposed model in a noisy environment.The model is trained with raw data and then tested with samples added with white Gaussian noise with different signal-to-noise ratios (SNRs).SNR is defined as the ratio of the signal power to the noise power and it is frequently expressed in decibels (dB), as follows:SNR dB = 10 log 10 ( P signal P niose ), where P signal denotes the power of the signal and P niose indicates the power of noise.The SNR range is from −4dB to 10dB.The higher the SNR value, the stronger the intensity of noise.
In Figure 11, we examine the effect of training sample size on the test accuracy of each model in different noisy environments.In Figure 11a,b, SNR = −4 and 0 represent substantial noise interference.By contrast, in Figure 11c,d, SNR = 4 and 8 represent weak noise interference.The anti-noise capability of the proposed model is better than those of the other models.In particular, the advantage is more apparent in cases with intense noise, as shown in Figure 11a,b.Considering that the proposed method is not specifically designed to improve the anti-noise, according to the report in [21], we speculate that this anti-noise ability is derived from the twin network structure.

Domain Generation Experiments
To further verify the domain generalization ability of the proposed model, we conduct a cross-domain experiment where all models are trained in the source domain and tested in the target domain.It should be noted that the model does not touch the target domain data during training.The experiment was repeated five times for each task.The results of the cross-domain tasks were observed.The classification accuracies of the experiment are shown in

Domain Generation Experiments
To further verify the domain generalization ability of the proposed model, we conduct a cross-domain experiment where all models are trained in the source domain and tested in the target domain.It should be noted that the model does not touch the target domain data during training.The experiment was repeated five times for each task.The results of the cross-domain tasks were observed.The classification accuracies of the experiment are shown in Table 8, in which A-B refers to training on dataset A and testing on dataset B. The proposed SViT achieved the best performance among all the methods in all the scenarios.Specifically, SViT achieved an accuracy of 92.24% in C-A task (the most difficult task), which was 13.4%, 31.88%,12.86%, 2.8%, 12.56% and 11.57% higher than WDCNN, Siamese CNN, PSADAN, FSM3, DeIN and HCAE, respectively.This shows that the proposed method performs better domain generalization than the comparison methods.Tables 9-11 demonstrate precision, recall and F1 score compressions for cross domain task C-A with 6000 training samples.The results show that the proposed SViT outperformed all of the compared approaches.In the experiment, datasets contain vibration signals obtained from healthy, artificially damaged bearings and naturally damaged bearings.The datasets filenames selected are shown in Table 13.The details of the datasets selected are in Table 14.In the experiment, datasets contain vibration signals obtained artificially damaged bearings and naturally damaged bearings.The data selected are shown in Table 13.The details of the datasets selected are in T  12.In the experiment, datasets contain vibration signals obtained from healthy, artificially damaged bearings and naturally damaged bearings.The datasets filenames selected are shown in Table 13.The details of the datasets selected are in Table 14.

Results and Analysis
Performing the same implementation, Figure 14 shows the cross-domain tasks accuracy of comparison approaches and our method with the increasing number of training samples.The results show that our method outperformed the state-of-the-art methods in all the scenarios.Performing the same implementation, Figure 14 shows the cross-domain tasks accuracy of comparison approaches and our method with the increasing number of training samples.The results show that our method outperformed the state-of-the-art methods in all the scenarios.Table 15 reports the cross-domain tasks accuracy of different methods with 1800 training samples.The proposed method outperformed all comparative methods by 1.80-4.29%on average.Tables 16-18 compare the methods in precision, recall and F1 score in the cross-domain task E-D with 1800 training samples.The results also show that our method superior to the alternatives.

Conclusions
In this work, an intelligent bearing fault diagnosis method, i.e., SViT has been proposed to face the challenges coming from limited data and domain generation.We have designed a Siamese Vision transformer (SViT) to extract features efficiently.In addition, a loss function called DKLD has been proposed to improve our model's prediction accuracy and generation capability.Furthermore, a novel random mask training strategy has been conducted with the SViT to reduce the overfitting risk and improve the model's generation ability.We present the experimental results showing that our method has better generalization ability in the limited data and cross-domain tasks compared with the state-of-the-art approaches.
However, the proposed method in this paper still has some restrictions.For instance, this method is limited to cross-domain tasks on the same equipment.In addition, In the prediction stage of SViT, a little more supporting data in the target domain is still required, which limits the application scenarios of the proposed method.

( 1 )
The proposed SViT based on a Siamese network and ViT obtains satisfactory prediction accuracy in limited data and domain generation tasks.(2)We obtain a new loss function by combining the KL divergence of the two directions to improve the proposed model's performance.(3) A novel training strategy, random mask, focusing on increasing the diversity of input data distribution is designed to enhance the generation ability of the model.(4) The experimental result shows that the proposed method achieves effective accuracy rates and has satisfactory anti-noise and domain generation ability.

Figure 1 .
Figure 1.The overall framework of the proposed method.

Figure 1 .
Figure 1.The overall framework of the proposed method.

23 Figure 3 .
Figure 3.The architecture of the transformer.
sequence of the divided image, where N , p represent the number of image patches and width of a patch, respectively.(*) L is the linear projection and ' N D p x R   denotes the projected vectors, where D represent the dimension of vector space.

Figure 3 .
Figure 3.The architecture of the transformer.

Figure 4 .
Figure 4.The Random mask training strategy.

Figure 4 .
Figure 4.The Random mask training strategy.

Figure 5 .
Figure 5. CWRU.Bearing fault diagnosis test plat.We use half of the vibration signals to generate training samples and the remaining signals to generate the test set.As shown in Figure 6, the training samples are generated by a sliding window of 2048 points with 80 points of overlapping steps.The test set samples pass through sliding windows of the same size, but the samples are generated without overlapping.As shown in Table 5, the dataset includes 19,800 training samples and 750 test samples.Finally, the training and test samples of the proposed model are obtained through STFT.

Figure 6 .
Figure 6.Generate sample with overlap.3.3.1.Evaluating the Effectiveness of DKLD We set up a series of comparative experiments by randomly selecting 60, 90, 120, 200, 300, 600, 900, 1500, 6000 and 19,800 samples from datasets A, B and C. Each experiment uses 60% of the samples as the training set and the remaining samples as the validation set.To verify the proposed DKLD loss function's effectiveness, we use DKLD and crossentropy to train our model separately with different samples size and then compare the test results.As shown in Figure7, in the cases with a small number of training samples, DKLD significantly improves the model's performance compared with that of crossentropy.For example, when the sample size is 60 and 90, the accuracy rates of using DKLD are 1.33% and 0.56% higher than that of using cross-entropy, respectively.When the training sample size is increased to 120 and above, the performance of the two-loss functions is exceptionally close, reaching more than 99%.To improve the understanding of the effect of DKLD, we use t-distributed stochastic neighbor embedding (t-SNE) to visualize the output of the last hidden fully connected layer of the model trained with DKLD and cross-entropy in 60 sample sizes.As shown in Figure8a,b, the features of DKLD are more divisible than cross-entropy, particularly in the 1 and 3 categories.Figure8c,d shows the confusion matrix of the results.

Figure 7 .
Figure 7. Results of proposed model training with different loss functions.

Figure 7 .
Figure 7. Results of proposed model training with different loss functions.

Figure 7 .Figure 8 .
Figure 7. Results of proposed model training with different loss functions.

Figure 9 .
Figure 9.The accuracy of the proposed model with different numbers of transformer encoder

Figure 9 .
Figure 9.The accuracy of the proposed model with different numbers of transformer encoders.

Figure 10 .
Figure 10.Diagnosis results of the proposed method compared with those of the comparison models.

Figure 10 .
Figure 10.Diagnosis results of the proposed method compared with those of the comparison models.

Figure 13 .
Figure 13.Test rig of Paderborn bearing dataset.There work conditions are selected to obtain different domain datasets.In dataset D, the test platform runs at n = 1500 rpm with a load torque of M = 0.7 Nm and a radial force on the bearing of F = 1000 N. In dataset E, load torque changes to M = 0.1.In dataset F, radial force changes to F = 400 N. The details of three datasets are shown in Table12.

Figure 12 .
Figure 12.Feature visualization via t-SNE in cross-domain tasks.

3. 4 .
Case Study 2: Paderborn Dataset 3.4.1.Data Description As shown in Figure 13, there are five modules the Paderborn dataset electric motor, (2) torque-measurement shaft, (3) rolling bearing test modu and (5) load motor.Bearings are installed in the test module to collect expe Fault types of bearings include artificial and real damage.

Figure 13 .
Figure 13.Test rig of Paderborn bearing dataset.There work conditions are selected to obtain different domain datasets.In dataset D, the test platform runs at n = 1500 rpm with a load torque of M = 0.7 Nm and a radial force on the bearing of F = 1000 N. In dataset E, load torque changes to M = 0.1.In dataset F, radial force changes to F = 400 N. The details of three datasets are shown in Table12.

Figure 14 .
Figure 14.The mean accuracy of cross-domain task with the different number of training samples on the Paderborn dataset.(a) D-E; (b) D-F; (c) E-D; (d) E-F; (e) F-D; (f) F-E.

Figure 14 .
Figure 14.The mean accuracy of cross-domain task with the different number of training samples on the Paderborn dataset.(a) D-E; (b) D-F; (c) E-D; (d) E-F; (e) F-D; (f) F-E.

Table 1 .
Details of the proposed model.

Table 3 .
The comparison methods.

Table 4 .
Details of the comparison methods.

Table 5 .
Description of CWRU dataset.

Table 8 ,
in which A-B refers to training on dataset A and testing on dataset B. The proposed SViT achieved the best performance among all the methods

Table 9 .
Precision (%) comparison for cross-domain task C-A with 6000 training samples on CWRU.

Table 10 .
Recall (%) comparison for cross-domain task C-A with 6000 training samples on CWRU.

Table 11 .
F1 score (%) comparison for cross-domain task C-A with 6000 training samples on CWRU.

Table 12 .
Working conditions of test bearing on Paderborn dataset.

Table 13 .
Data sets used for experiments.

Table 12 .
Working conditions of test bearing on Paderborn dataset.

Table 12 .
Working conditions of test bearing on Paderborn dataset.

Table 13 .
Data sets used for experiments.

Table 14 .
Detail of datasets on Paderborn.

Table 14 .
Detail of datasets on Paderborn.

Table 15
reports the cross-domain tasks accuracy of different methods with 1800 training samples.The proposed method outperformed all comparative methods by 1.80-

Table 15 .
Mean classification accuracy (%) with 1800 samples on the Paderborn dataset.

Table 16 .
Precision (%) comparison for cross-domain task E-D with 1800 training samples per class on the Paderborn dataset.

Table 17 .
Recall (%) comparison for cross-domain task E-D with 1800 training samples per class on the Paderborn dataset.

Table 18 .
F1 score (%) comparison for cross-domain task E-D with 1800 training samples per class on the Paderborn dataset.