A Novel Bearing Fault Diagnosis Method Based on Few-Shot Transfer Learning across Different Datasets

At present, the success of most intelligent fault diagnosis methods is heavily dependent on large datasets of artificial simulation faults (ASF), which have not been widely used in practice because it is often costly to obtain a large number of samples in reality. Fortunately, various faults can be easily simulated in the laboratory, and these simulated faults contain a lot of fault diagnosis knowledge. In this study, based on a Siamese network framework, we propose a bearing fault diagnosis based on few-shot transfer learning across different datasets (cross-machine), using the knowledge of ASF to diagnose bearings with natural faults (NF). First of all, the model obtains a good feature encoder in the source domain, then defines a fault support set for comparison, and finally adjusts the support set with a very small number of target domain samples to improve the fault diagnosis performance of the model. We carried out experimental verification from many aspects on the ASF and NF datasets provided by Case Western Reserve University (CWRU) and Paderborn University (PU). The results show that the proposed method can fully learn diagnostic knowledge in different ASF datasets and sample numbers, and effectively use this knowledge to accurately identify the health state of the NF bearing, which has strong generalization and robustness. Our method does not need second training, which may be more convenient in some practical applications. Finally, we also discuss the possible limitations of this method.


Introduction
Bearings are indispensable parts of much important machinery and equipment, which may lead to serious economic losses and casualties in the event of failure [1]. Therefore, it is essential to obtain the state of the bearing quickly and accurately. In recent years, machine learning has been applied to intelligent fault diagnosis of bearings because of its powerful ability. Nowadays, many well-known machine learning methods, such as support vector machine (SVM) [2], deep Boltzmann machine (DBM) [3], convolution neural network (CNN) [4], generate adversarial network (GAN) [5], and so on, have achieved excellent results. The success of most studies, however, are heavily dependent on a large number of artificial simulated faults (ASF) data, which has the following two conditions: (1) there is a large amount of marked data with fault information; and (2) the training data and testing data come from the same probability distribution. However, for a variety of reasons [6], it is impractical to obtain a large number of actual fault data in the real world, which cannot meet the first condition; the second condition cannot be satisfied because of the great difference in the feature distribution between the ASF and natural faults (NF). Therefore, many research results are not applicable to the working environment of real machines and cannot be widely applied in industrial production.
Recently, some researchers have tried to expand the amount of data by means of data over-sampling [7,8] and data generation [9] to solve the dilemma of limited fault machines, it is proved that it is effective to collect tagged fault data from one machine for training and to test another machine.
To sum up, although few-shot learning has achieved certain achievements in fault diagnosis with limited samples, these achievements are based on the standard dataset of ASF, and there is a great difference between real fault and simulated fault, so it cannot be directly applied to real industrial machines. In the current research on transfer learning, whether between different working conditions or between ASF and NF, the source domain and target domain of most experiments come from the same dataset (the source and target domain data is collected on the same test bench or machine) and follow the same feature distribution. This ignores a major problem: if you want to apply it to a real machine, you need to obtain a large amount of appropriate source domain data in the same real machine, which is not feasible. Therefore, some researchers try to obtain a large amount of fault data from a machine that is convenient for data collection, and extract diagnostic knowledge from it to identify the health status of another machine, and we believe that this is a feasible method to solve the problem that a large number of fault data cannot be obtained in real industrial production. The reason for this is that it is relatively easy to obtain ASF data in the laboratory, which includes the diagnostic knowledge of real machine bearings.
In this paper, we propose a bearing fault diagnosis method based on few-shot transfer learning across different datasets (cross-machine) inspired by the fine-tuning-based method. Our model is based on the framework of a Siamese network and has the ability of few-shot learning. First of all, the model is trained with the ASF data, and the available diagnosis knowledge is learned. Then, a fault support set for comparison is defined and it is assumed that a very small number of NF samples can be obtained. These few NF samples are input directly into the support set or replace the original samples to improve the generalization ability of the model. Finally, the knowledge of ASF is used to effectively identify the health state of the new machine bearings. The main innovations and contributions of this paper are as follows: (1) In view of the problems that most of the current intelligent fault diagnosis methods cannot be directly applied to industry, a few-shot transfer learning method across different datasets is proposed, which can use the diagnostic knowledge learned from ASF data to effectively identify the health state of the new machine bearings. (2) For the first time, a very small number of target domain samples are used to replace the original samples of the support set in fault diagnosis, which improves the generalization ability of the model, and has very high stability and accuracy even in different datasets (ASF-NF) with great differences in feature space distribution. (3) Several experiments are designed to compare and verify many aspects of the proposed method, which has achieved the expected results, and our method does not need secondary training, which will be more convenient.
The structure of this paper is as follows: Section 2 introduces the theoretical background of the method proposed in this paper. Section 3 introduces the proposed method and our model. Section 4 carries on the experiment and analysis from different aspects. Section 5 gives the main conclusions.

Few-Shot Learning Strategy
When human beings recognize a new thing, they may only need to learn knowledge from a few instances to be able to accurately identify such things. Few-shot learning is proposed in order to acquire this human skill. The general strategy of few-shot learning based on a Siamese network is shown in Figure 1. Different from the general deep learning strategy, the input during training is a pair of the same or different samples (x 1 , x 2 ), one only needs to label the sample pairs (x 1 , x 2 ) with the same or different class. The output is the probability of similarity between sample pairs (x 1 , x 2 ). When testing, there are mainly two strategies: one-shot k-way and N-shot k-way. One-shot k-way refers to the k categories in the support set, each class has only one instance; and N-shot k-way means that there are k categories in the support set, and each class has N instances. In the one-shot k-way test, a test samplex that need to be classified and a support set are given, the support set is defined as shown in Equation (1). Next, the model judges the similarity between samples (x 1 , x 2 , x 3 , . . . , x k ) in the support set and the test samplex, and selects the highest similarity as the same class ofx, as shown in Equation (2).
The y is the label of the class, k represents the kth fault class.
The P is the probability of similarity, C is the fault class most similar to the test samplex.
In the N-shot k-way test, there are k classes in the support set, each class has N different instances, such as shown in Equation (3), and the support set is shown in Equation (4).
The H is a set containing multiple instances of the same class, k represents the kth fault class, N represents the Nth instance in the same fault class.
The model will judge the similarity between the k*N instances of the support set and the test samplesx, and select the highest similarity as the same class ofx, as shown in Equation (5).
Here P and C are the same as Equation (2), but the difference lies in the difference between S and S k .

Fine-Tuning-Based Method
The main goal of fault diagnosis based on transfer learning is to transfer the learned knowledge from the source domain to the target domain. Among the many current transfer learning strategies, the fine-tuning-based method has been widely studied and proved to be effective. We are inspired by the fine-tuning-based method and put forward our method strategy.
The learning process of fine-tuning-based method is divided into two stages. First, the network model learns the knowledge of diagnosis in the source domain; then, fine-tuning the full connection layer in the target domain to obtain a new classifier as shown in Figure 2.

The Proposed Few-Shot Transfer Learning Methods
We are inspired by the fine-tuning-based method and put forward our method strategy. From Section 2, we can see that the support set plays an important reference role in the Siamese network. The test sample x examples are always compared with the samples in the support set, and the most similar examples in the support set are selected for classification, as shown in Equations (2) and (5). In the few-shot transfer learning based on a Siamese network, we assume that a small amount of target domain data has been obtained and use them to adjust the support set, as in Figure 3. The following two few-shot transfer learning methods are proposed.  This method adds a very small amount of target domain samples (x t , y t ) to the original support set after the training with source domain data, and finally tested. In this case, the expression of the support set is Equation (6). In this paper, we uniformly use S(s+t) to denote the method of directly add target domain samples to the support set.
The s in x s represents from the source domain, t in x t represents from the target domain.
(2) S(t): Replace the original sample in the support set with the target domain sample.
In this method, after training the model with source domain data, a very small number of target domain samples are used to replace the original samples in the support set, and the model is finally tested. At this point, the support set is shown in Equation (7). In this paper, we uniformly use S(t) to denote the replacement of the original sample in the support set with the target domain sample.
The t in x t represents the target domain. Figure 4 shows the model we use. This is a Siamese network with a deep convolution neural network with a wide first layer core (WDCNN). In this model, the two WDCNN have the same structure and parameters, and the weights are shared. The setting of the WDCNN network architecture is shown in Table 1, which is consistent with the setting in reference [33]. This design strategy is used because the vibration signal is more sensitive to the overall correlation in the time domain or frequency domain, and the useful information in the signal will be lost if the first layer core is too small, and because all layers are small cores which may be affected by high-frequency noise common in the industrial environment, resulting in poor performance of feature coding. It is proved that WDCNN with the first layer of wide kernel has good anti-noise ability, generalization ability and robustness. The model consists of a series of convolution layers, the step size of the first layer is set to 16, and the step size of the other layers is fixed to 1. In order to optimize the performance of the model, the number of convolution filters is a multiple of 16. In the previous convolution layer, the Relu activation function is used to encode the features, and the full connection layer uses the sigmoid activation function to map the features.  Input is a pair of samples (x 1 , x 2 ), which can be the same or different. The output is the probability of similarity between the sample pairs. Firstly, the metric distance between the outputs of the network is optimized by Equation (8), where f represents a deep convolution network. Equation (9) determines the probability of similarity, where sigm represents the Sigmoid function and FC is a dense fully connected layer.

Model
be a length-M vector which contains the labels for the minibatch. Now we assume y(i) equal to 1 when x is the same class, and y(i) equal to 0 when x is different class. We impose a regularized cross-entropy objective on our binary classifier of the following form: The optimizer we chose is Adam, which calculates individual adaptive learning rates. Update parameters through Equation (11): where w (T+1) means the parameters at epoch T, L (T) is the loss function, β i is the forgetting factor of the ith moment of the gradient, m and v are moving averages.

Data Introduction and Processing
Like most deep learning algorithms, in order to confirm our proposed transfer learning strategy, we need to prepare appropriate data samples. We selected the data provided by Case Western Reserve University (CWRU) [34] as the ASF datasets, that is, the source domain. They are collected from the experimental platform of CWRU (shown in Figure 5), and all use the single point damage of electro-discharge machining (EDM). The vibration acceleration signal of the faulty bearing is collected by the accelerometer, and the sampling frequency is 12 kHz. The bearings selected in this paper are installed at the drive end, and there are three types of bearings: inner ring fault bearing, outer ring fault bearing and normal bearing. The parameters are shown in Table 2.  On the modular test bench (Figure 6), the Paderborn University (PU) researchers with 6 sets of normal bearing data, 12 sets of artificially damaged bearing data of three fault types, and 14 groups of naturally damaged bearing data caused by accelerating lifetime test [35]. Damage levels are divided according to the percentage of length of the damage relative to pitch circumference is calculated ( Table 3). The vibration acceleration signal of the faulty bearing is collected by the accelerometer, and the sampling frequency is 64 kHz. We choose the natural damage dataset of PU as the target domain data, and the details of the parameters are shown in Table 4.  We sampled and processed the CWRU source domain data in Table 2, taking all 2048 data points as a sample. Because there are not enough data points in the original data, the number of samples that can be intercepted is too small, and when the number of training samples is very small, it is easy to cause over-fitting. Therefore, we use the method of overlapping sampling as shown in Figure 7a. There is a partial overlap between each sample and the subsequent sample, with an offset of 80 and the training samples are obtained. Similarly, we process the PU natural damage fault data as shown in Figure 7b. Finally, the testing samples are obtained and a small number of samples for adjustment support set (SNSASS) are obtained. It is worth noting that the testing samples and SNSASS are independent and not duplicated. SNSASS can be seen as a small number of samples of real machines that can be obtained. The experimental samples are shown in Table 5.

S(s), S(s+t) and S(t) Analysis
To verify the validity of our proposed transfer method, we performed the following three experiments as shown in Table 6. (1) S(s): direct transfer method (baseline).
Direct transfer method is a simple method without any optimization and adjustment of fixed network parameters. This method uses source domain data for training and directly uses target domain data for testing. In this experiment, the support set of the direct transfer method based on a Siamese network is shown Equation (12), and the samples are all training samples from the source domain. Direct transfer method based on the Siamese network is expressed by S(s).
The s in x s represents from the source domain.
In the experiment of S(s), we use the ASF data from CWRU to train and learn in the Adam optimization program, the epochs of training are 90, and batch size chooses 64, and the diagnostic knowledge learned is fixed. In the testing process, we directly input the NF samples provided by PU into the model for feature extraction, and then select the samples that are most similar to the test samples from the support set (the samples in the support set are training samples), and think that they are the same class.
(2) S(s+t): directly add target domain samples to the support set.
The training process is the same as that of (1). Before testing, however, SNSASS are added to the support set as a classification reference. The testing process is the same as that of (1), except that the support set contains both training samples and SNSASS. In the experiment of S(t), the process of training stage is consistent with that of (1), and then all the samples of the original support set are replaced by SNSASS (the sample in the support set at this time is SNSASS). In the process of testing, input PU samples to test and obtain the results.
First of all, we verify the performance of S(s) (baseline), S(s+t) and S(t) in A→D, B→D, C→D, A→E, B→E and C→E transfer tasks, each experiment is carried out 10 times, and finally take the average. The experimental results are shown in Figure 8. It can be seen from Figure 8 that S(t) has an absolute advantage in all tasks. The accuracy is more than 89.69%, which is much higher than the other two methods, 42.18% higher than S(s) in C→D. This is because, based on the S(t) learning theory, the instances of the support set are all SNSASS (x t , y t ) from the target domain, and the spatial distribution of the feature space of the test samplesx that need to be classified is very similar to that of x t , so it is easy to find similar examples in the support set and regard them as the same class of fault. The experimental results of S(s) and S(s+t) are very close, but in most cases the accuracy of S(s+t) is slightly higher than that of S(s). This is because the support set of S(s+t) has a small number of SNSASS. Based on the few-shot learning theory (see Section 2), these SNSASS can help the test samplex to better find the most similar class to itself. However, its number accounts for a small proportion (see Equation (13), η = 15 1980+15 ≈ 0.75%) in the support set, which cannot bring great performance improvement as S(t) (η = 15 15 = 100%) does. η = n SNSASS n SNSASS + n Training (13) where η is the proportion of the number of SNSASS in the total quantity. n SNSASS is the number of SNSASS. n Training is the number of training sample. In order to further verify the effect of η on S(s+t) and S(t), we gradually increase the number of SNSASS and repeat the experiment again, each experiment is repeated 10 times, and the result is shown in Figure 9. As can be seen from Figure 9a, with the increase in the number of SNSASS (η increase), the accuracy of S(s+t) does not increase linearly, but it shows an increasing trend as a whole, especially in A→D, B→D and C→D experiments. However, with the increase in SNSASS, the performance of S(t) has not been improved as shown in Figure 9b, and the accuracy fluctuates within an allowable error range. In other words, if we can obtain a small amount of target domain data, S(t) can give full play to its performance.

Comparisons with Other Methods
We also contrast our method with some popular methods, which include WDCNN [18,33], CNN_MMD [36], CNN_FT [37], DANN [38] and MRN [30]. It should be noted that we set the experimental parameters to the best case according to the characteristics of each method, including data format, hyperparameters, epochs, and so on. The number of training samples is 1980, the number of SNSASS is 15, and the number of test samples is 225. Similarly, each method is tested 10 times in the A→D, B→D, C→D, A→E, B→E and C→E transfer tasks in turn, and the results are averaged. The experimental results are shown in Figure 10. Experiments show that S(t)_5-shot achieves the highest accuracy in all transfer learning tasks, with an average of 96.53%, and S(t)_1-shot ranks second with an average of 94.63%, followed by MRU, CNN_FT, DANN, S(s+t), S(s), CNN_MMD. There is no doubt that WDCNN performs the worst among all transfer learning tasks, with an average accuracy of only 44.86%. Of course, we know that it is unfair to compare WDCNN with these advanced methods, but it also reflects the difficulty of these transfer learning tasks. After all, there is a big gap between the fault features of ASF and those of NF. It is also evident from the figure that in almost all methods (except WDCNN) the results of A→D, B→D and C→D are worse than A→E, B→E and C→E, the reason is that D's lower damage (level 1 damage) level than E (level 2 damage), E's more serious damage and more obvious failure features. Learning the knowledge of A, B and C breakdown to diagnose E would be better. Figure 11 shows the standard deviation of 10 repeated experiments for each method in the A→D, B→D, C→D, A→E, B→E and C→E transfer learning tasks. As can be seen from Figure 11, S(t)_5-shot has the smallest standard deviation among all transfer learning tasks with an average of 2.66%, followed by S(t)_1-shot with an average of 3.53%, much smaller than other methods. Except that the average standard deviation of MRN is 8.45%, the rest are more than 10%, which means that it is difficult to learn diagnosis knowledge from ASF to diagnose NF, resulting in very unstable diagnosis results of other methods. Simultaneously, it is demonstrated that S(t) has much higher stability than other methods.

The Influence of Different Source Domain and the Number of Training Samples
It is not only the quality of the data in the source domain that is very important, but the quantity of the data is also important. It is very important to select the appropriate bearing fault source domain data and quantitative training model, but in the actual industrial production, it is difficult to determine the appropriate source domain data. Therefore, in this section we discuss the harshness of the proposed method on the source domain data. We selected several relatively well-performing methods for comparison, Figure 12 is the result curve of DANN, MRN, CNN_FT and S(t)_1-shot learning fault diagnosis knowledge from datasets A, B and C, respectively, and used it to diagnose D and E. It can be clearly seen that DANN, MRN and CNN_FT learns knowledge from different source domains and fixes the model, which leads to great differences in experimental results. The reason is that there is a big gap between the working conditions of A (1772 rpm), B (1750 rpm) and C (1730 rpm) in speed. However, compared with other methods, the result of S(t)_1-shot learning from A, B and C to diagnose D and E has only a small change and a slight downward trend, indicating that S(t) has good ability to learn and can be well leverage knowledge of the source domain. The reasons for the slight downward trend with A-B-C are as follows: according to the speed of A, B and C, we think that the working condition of A is more complex than that of B, and that of B is more complex than that of C. The model can learn more obvious fault features under more complex working conditions, so as to better complete the transfer task. In [18,21,39], the authors have also obtained a similar conclusion. More complex working condition will have more diagnostic knowledge. Next, we want to explore the influence of different fault diameters of the source domain bearing. Therefore, an additional small experiment was performed here. Following the principle of control variables, CWRU data (the load is 3 hp) with fault diameters of 0.021, 0.014, and 0.007 inches were used as training sets and tested in D and E, the result is shown in Figure 13.   As can be seen in Figure 13, S(t) effectively learns diagnostic knowledge from different fault diameters and shows the best performance, followed by MRN, and the worst is DANN, which surprises us with just over 20% in 0.014 inches. However, we failed to find the rule that the fault diameters affect the performance of the model, which may be due to the big difference between ASF and NF.
In order to explore the performance of various methods under different sample numbers, the following groups of experiments were carried out when the number of training samples was 90, 300, 600, 1200, 1500 and 1980. As shown in Figure 14, it presents the curve of all experimental results with the number of training samples. Incredibly, the experimental results do not improve with the increase in the number of training samples, but show a special curve shape. It is because having too small a number of training samples will cause the model to learn insufficient knowledge that can be used in the target domain, resulting in poor performance when diagnosing in the target domain; and having too many training samples will cause the learned knowledge to be too focused on the source domain, which is not applicable when it is transferred to the target domain. However, compared with other methods, the results of S(t) do not change greatly with the number of data samples, which shows that the dependence of S(t) method on data samples is very small and stable. This is because the few-shot learning strategy of S(t) can learn and use knowledge in a small number of training samples and is not sensitive to the growth of data. Assuming that a small sample of the target domain is obtained, similarly to finetuning-based methods, S(t) can improve the performance of the model in transfer learning. However, after obtaining the new target domain data, fine-tuning-based methods still need to train the models that have been trained in the source domain. The S(t) method does not need a second training, but only needs to input these small target domain samples into the support set, which is more convenient than fine-tuning-based methods in some practical applications.

Conclusions
In this paper, it is established that there is still a long distance between the research of intelligent fault diagnosis and the practical industrial application. A bearing fault diagnosis based on few-shot transfer learning across different datasets is proposed, which uses a very small number of target domain samples to adjust the support set to improve the generalization performance of the model. Many groups of transfer experiments are carried out by using the ASF dataset of CWRU and the NF dataset of PU. The conclusions are as follows: (1) With only a small amount of SNSASS, S(t) method greatly improves the accuracy of fault classification, and the accuracy of S(s+t) is not significantly improved, but increases with the increase in the number of SNSASS.
(2) Compared with other methods, the proposed S(t) method has the highest accuracy in all cases and is also the most stable method. (3) S(t) can fully learn diagnostic knowledge in different source domains and sample numbers, and effectively use this knowledge to identify the health state of the target bearing, which has strong generalization and robustness. In addition, unlike the fine-tuning-based method, S(t) does not need secondary training, which is more convenient in some practical applications.
S(t) provides a feasible way to apply laboratory data knowledge to real machine fault diagnosis, solves the difficulty that a large amount of data cannot be collected in the real world, and also provides a new idea and method for transfer learning. However, obtaining a small amount of target domain data (SNSASS) is the key to the S(t) method. In some cases of actual industrial production, it is also not easy to obtain a small amount of target domain data, which is a limitation of the S(t) method. At the same time, although the difference between ASF and NF brings great challenges to the transfer learning tasks, because of the lack of available data, we were only able to perform three classification tasks. More classification experiments and verification in more datasets can be performed in the future.