Few-Shot Rolling Bearing Fault Diagnosis with Metric-Based Meta Learning

Fault diagnosis methods based on deep learning and big data have achieved good results on rotating machinery. However, the conventional deep learning method of bearing fault diagnosis is mostly based on laboratory artificial simulation data, and there is an error with actual fault data, which will reduce the generalization performance of the deep learning method. In addition, labeled data are very precious in real industrial environment. Due to expensive equipment and personnel safety issues, it is difficult to obtain a large amount of high-quality fault labeling data. Therefore, in this paper, we propose a metric-based meta-learning method named Reinforce Relation Network (RRN) for diagnosing bearing faults with few-shot samples. In the proposed method, a 1D convolution neural network is used to extract fault features, and a metric learner is used to predict the similarity between samples under different transfer conditions. Label smoothing and the Adabound algorithm are utilized to further improve the performance of network classification. The performance of the proposed method is verified on a dataset which contains artificial damage and natural damage data. The comparison studies with other methods demonstrate the superiority of the proposed method in the few-shot scenario.


Introduction
Bearings, as rotating machinery, play a very important role in motors and engines. Bearing failure will cause mechanical damage and even threaten the safety of users. Therefore, accurate prediction and diagnosis of various bearing failures in real industrial scenarios is of great significance. In the past few years, a large number of traditional signal processing and machine learning methods have been applied to bearing fault detection, including wavelet transform (WT), Fourier transform, empirical mode decomposition (EMD) [1,2], principal component analysis (PCA) [3], SVM [4], k-nearest neighbor [5], and random forest [6]. Ren [7] proposed a 3-D waterfall spectrum in combination with reassigned wavelet scalogram method to solve non-linear and non-stationary vibration signal, while Yan [8] proposed a novel multiscale morphology analysis method, which can preserve signal details and has a good performance in detecting the defects in bearing. However, the conventional fault diagnosis method has some disadvantages: (1) Based on signal processing and conventional machine learning methods, a large number of manual feature extraction operations are required, which cannot adapt well to the complex dynamic system of bearing vibration signals; 1. Propose a metric-based few-shot meta learning method for bearing fault diagnosis; 2.
Label smoothing is adopted to alleviate over-fitting and improve generalization in few-shot learning; 3.
Adabound is first introduced in fault diagnosis, which can converge faster and obtain higher accuracy.
The rest of this article is organized as follows. Section 2 introduces few-shot learning and meta learning. Section 3 elaborates on the proposed method. Section 4 introduces label smoothing and Adabound. In Section 5, the proposed method is evaluated by the artificial-natural bearing datasets. Finally, Section 6 concludes this article.
few-shot meta learning, only a few samples are given from the training set, for problems such as fewshot scenario and conditions transfer can be effectively solved by few-shot meta learning technology. Based on the above reasons, this paper proposes a Metric-based few-shot meta-learning technology to be applied to bearing fault diagnosis and verified on artificial-natural bearing datasets. The results show that the proposed method is better than conventional methods in few-shot scenarios. The contribution points of this article are summarized as follows.

Few-Shot Learning
Machine learning has been highly successful in data-intensive applications, but is often hampered when the dataset is small [21]. However, humans can learn to recognize a new object or master a new concept with only one-or few-shot instances; the essential reason for this is that humans can use prior knowledge to learn. Inspired by this, few-shot learning is proposed, based on prior knowledge, to learn data features in few-shot scenarios, to solve the prediction and classification problems in the case of missing data [22]. Few-shot learning is a type of machine learning problem, where experience contains only a limited number of examples with supervised information for the target task. Few-shot learning can be divided into three categories based on current research progress: data, models and algorithms [23]. As shown in Figure 2a, data-based method can use prior knowledge to augment the raw data from h1 to h2; with sufficient data, conventional deep learning methods can be used. For the model-based method, in Figure 2b, prior knowledge is used to constrain the complexity of H, where H is the hypothesis space determined by the model, and h4 is the optimal hypothesis from data to label. In Figure 2c, an algorithm-based method is the optimization strategy which uses prior knowledge to search through H in order to find the best hypothesis h3 in H.
Data augmentation technology has been used extensively in tasks such as computer vision and natural language processing in the past. In the field of bearing fault diagnosis, Zhang [24] performed data augmentation by manually copying and intercepting the original signal, Li [25] used Generative Adversarial Networks (GAN) to solve the problem of category imbalance, Gao [26] used a combination of finite element (FEM) and GAN, not only to supplement the number of missing labeled data, but also to supplement the missing attributes, and Cubuk [27] described a simple procedure called AutoAugment, which automatically learns the augmentation policy for deep network training. The core idea of the above method is based on the existing labeled data, that is, prior knowledge, to create similar labeled data or copy directly according to the extracted features, so as to train the neural network on a large amount of labeled data to obtain a good performance. However, the augmentation rules can be specific to the dataset, making them hard to apply to other datasets. Therefore, manual data augmentation Sensors 2020, 20, 6437 4 of 15 cannot solve the FSL problem completely [28]. Moreover, the GAN-based data augmentation method has disadvantages, as the training and generation results of GAN are not robust enough.
problems in the case of missing data [22]. Few-shot learning is a type of machine learning problem, where experience contains only a limited number of examples with supervised information for the target task. Few-shot learning can be divided into three categories based on current research progress: data, models and algorithms [23]. As shown in Figure 2a, data-based method can use prior knowledge to augment the raw data from h1 to h2; with sufficient data, conventional deep learning methods can be used. For the model-based method, in Figure 2b, prior knowledge is used to constrain the complexity of H, where H is the hypothesis space determined by the model, and h4 is the optimal hypothesis from data to label. In Figure 2c, an algorithm-based method is the optimization strategy which uses prior knowledge to search through H in order to find the best hypothesis h3 in H. Data augmentation technology has been used extensively in tasks such as computer vision and natural language processing in the past. In the field of bearing fault diagnosis, Zhang [24] performed data augmentation by manually copying and intercepting the original signal, Li [25] used Generative Adversarial Networks (GAN) to solve the problem of category imbalance, Gao [26] used a combination of finite element (FEM) and GAN, not only to supplement the number of missing labeled data, but also to supplement the missing attributes, and Cubuk [27] described a simple procedure called AutoAugment, which automatically learns the augmentation policy for deep network training. The core idea of the above method is based on the existing labeled data, that is, prior knowledge, to create similar labeled data or copy directly according to the extracted features, so as to train the neural network on a large amount of labeled data to obtain a good performance. However, the augmentation rules can be specific to the dataset, making them hard to apply to other datasets. Therefore, manual The most common model-based, few-shot learning technology is embedding learning [29]: the training set and testing set are denoted D train and D test , the embedding function F projects the training sample data X train ∈ D train to a low-dimensional space Z, the embedding function g projects the testing samples X test ∈ D test to Z, then a similarity equation S is used to predict the embedding similarity between classes. Zhang [30] used a Siamese network for bearing fault diagnosis, Vinyal [11] proposed a matching network as a semi-supervised method to assign unlabeled samples to augment D train via soft-assignment during learning, Sung [31] used a relational network to embed samples into the status space at the same time, and used a convolutional neural network to automatically find similarities between different image categories, and Snell [32] proposed a prototypical network, Instead of comparing f (x test ) with each f (xi) where xi ∈ D train , the prototypical network only compares f (x test ) with the class prototypes in D train . For class n, the prototype is calculated by the formula where X i is one the K examples of the nth class in D train , and f is the embedding function. The algorithm-based method uses strategies to find better initialization parameters more quickly or refine existing parameters. Hinton [33] proposed a fine-tuning based method; this strategy takes the θ 0 of a pre-trained model learned from related tasks as a good initialization, and adapts it to θ by D train . The assumption is that θ 0 captures some general structures of the large-scale data. Therefore, it can be adapted to D test with a few iterations. Zhang [34] proposed a few-shot learning approach named model-agnostic meta-learning (MAML), which improve the efficiency of the model. Nichol [35] proposed a new algorithm called reptile, which can obtain a better initialization parameter than MAML and pre-training.
All the above few-shot learning methods have been widely used in the field of CV and NLP, However, few-shot learning in the field of rotating machinery fault diagnosis is still very scarce. In order to further develop the few-shot fault diagnosis of bearings, it is necessary to explore the applicability of the few-shot learning method in bearing fault diagnosis and the advantages and disadvantages compared with conventional methods.

Few-Shot Meta Learning
Meta learning is one of the most promising and trending research areas in the field of artificial intelligence right now. It produces a versatile AI model that can learn to perform various tasks without having to train them from scratch, and this method is widely used in the field of few-shot learning, also Sensors 2020, 20, 6437 5 of 15 known as few-shot meta learning. We can categorize meta learning into three categories: metric-based meta learning, initialization-based meta learning and optimization-based meta learning [36].
Metric-based meta learning will learn the similarity between different classes. It uses a neural network to extract the features from a dataset and finds the similarity by computing the distance between different features of these classes, such as Siamese networks, prototypical networks and relation networks. Initialization-based meta learning will initialize the weights with optimal values or close to optimal values; by using this method we can attain the convergence faster, such as MAML, Meta-SGD [37] and reptile. Optimization-based meta learning will have two networks: one base network that actually tries to learn and a meta network that optimizes the base network. Mainstream research methods of optimization-based meta learning include the LSTM Meta-Learner [38].

Data Preprocessing
During the raw data preprocessing, time shift, the frequency resolution of the vibration signal needs to be considered. Since the neural network cannot satisfy the time-shift invariance of the vibration signal, it is necessary to convert the original vibration signal to the frequency domain through Fast Fourier Transform (FFT) to solve this problem. For the frequency resolution (FR), it is necessary to intercept a signal of sufficient length to ensure the FR, but, at the same time, the model capacity and computational overhead must be considered, so, after trade-off, the input length of models is uniformly set as 1024. For the random noise carried by the data, in the preprocessing module, methods such as Gaussian filter [39] are used, and methods such as increasing the first layer convolution kernel [24], and threshold denoising algorithm [40] are used in the model module. This article mainly uses [24] for anti-noise processing.

Network Structure
Since the transfer learning with pre-training and fine-tuning is very similar to the few-shot learning, this paper will compare two methods; the same 1D convolutional neural network structure, pooling layer, batch standardization and activation function will be used in both networks. In the first layer, a large convolution kernel will be adopted to capture more shallow features while reducing the influence of high-frequency noise [41]. Both networks will use the same feature extractor; the full connection layer and Softmax is used as the classifier in transfer learning; the convolutional neural network is applied to the metric learner of the few-shot learning method.

Methods
The transfer learning method based on pre-training and fine-tuning trains the data with a large number of labeled data, and then fine-tunes the network on a specific category; the weight of feature extractor is fixed and only the classifier is trained. Based on the few-shot meta learning, the data are divided into a meta training stage, which becomes the source domain, and the meta testing stage also becomes the target domain, among which both are divided into a support set and query set. The support set in the two domains is used to calculate the prototype of data features, and the query set in the two domains is used to train and improve the model performance. As shown in Figure 3, a common N way K shot problem is that N categories are selected on the support set of the meta testing stage, and K samples are selected for each category, without limiting the capacity of the other set.
becomes the target domain, among which both are divided into a support set and query set. The support set in the two domains is used to calculate the prototype of data features, and the query set in the two domains is used to train and improve the model performance. As shown in Figure 3, a common N way K shot problem is that N categories are selected on the support set of the meta testing stage, and K samples are selected for each category, without limiting the capacity of the other set.

Transfer Learning
Pre-training refers to training on the source domain to obtain a feature extractor, and fine-tuning refers to fixing the feature extractor on the target domain and training the classifier. The feature encoder is learned in the source domain by Adam optimizer with the learning rate of 0.001, the training epoch is 100, and a new classifier is trained in the target domain with the momentum-accelerated stochastic gradient descent with the learning rate of 0.01, the N way K shot task is divided into a minibatch, and trained with 50 epochs, and the average accuracy of the last ten times is taken, and the training will be repeated 20 times to offset the randomness of sampling.

Few-Shot Meta-Learning
The few-shot meta-learning in this paper will adopt a metric-based meta-learning method: Reinforce Relational Network (RRN). In solving the few-shot problem, the relational network has achieved state-of-the-art results in the metric-based meta-learning model. The feature extractor of the RRN is the same as the transfer learning method: the metric learner consists of two convolutional layers and two fully connected layers, samples of the support set and query set are processed by feature extractor and embedded into the same feature space, the similarity of the two sets was computed by metric learner and the similarity value was predicted.
As shown in the Figure 4, meta learning usually uses a strategy called episodic training [42]. The training procedure, support set and query set are extracted by CNN module and imported into the classifier, where the objective is learning the projection function r ϕ ( f θ (x)) between data and labels, which can be formalized as follows where f θ is feature extractor, Z s and Z q are embedding vectors of the support set and query set.

Transfer Learning
Pre-training refers to training on the source domain to obtain a feature extractor, and fine-tuning refers to fixing the feature extractor on the target domain and training the classifier. The feature encoder is learned in the source domain by Adam optimizer with the learning rate of 0.001, the training epoch is 100, and a new classifier is trained in the target domain with the momentumaccelerated stochastic gradient descent with the learning rate of 0.01, the N way K shot task is divided into a minibatch, and trained with 50 epochs, and the average accuracy of the last ten times is taken, and the training will be repeated 20 times to offset the randomness of sampling.

Few-Shot Meta-Learning
The few-shot meta-learning in this paper will adopt a metric-based meta-learning method: Reinforce Relational Network (RRN). In solving the few-shot problem, the relational network has achieved state-of-the-art results in the metric-based meta-learning model. The feature extractor of the RRN is the same as the transfer learning method: the metric learner consists of two convolutional layers and two fully connected layers, samples of the support set and query set are processed by feature extractor and embedded into the same feature space, the similarity of the two sets was computed by metric learner and the similarity value was predicted.
As shown in the Figure 4, meta learning usually uses a strategy called episodic training [42]. The training procedure, support set and query set are extracted by CNN module and imported into the classifier, where the objective is learning the projection function r ( ( )) f x ϕ θ between data and labels, which can be formalized as follows * * r arg min( ( ( ), )); where f q is feature extractor, Zs and Zq are embedding vectors of the support set and query set. The average of the 100 episodes is finally extracted as the final accuracy, and the training will be performed 20 times to offset the randomness of the sampling. The average of the 100 episodes is finally extracted as the final accuracy, and the training will be performed 20 times to offset the randomness of the sampling.

Label Smoothing
Due to the overfitting problem of few-shot learning, the generalization ability of the model is reduced. In addition, the original data will be mixed with an abnormal signal, and the one-hot encoding will lead to the model learning abnormal characteristics, and reduce its generalization ability. Therefore, this article uses label smoothing to alleviate the above problems [43]. The core of label smoothing is to modify the ground-truth label where K is the number of label classes, and α is a hyperparameter that determines the amount of label smoothing. If α = 0, we obtain the original one-hot encoded y_hot. If α = 1, we obtain the uniform distribution. The purpose of label smoothing is to soften one-hot encoding. The content of one-hot encoding only includes 0 and 1, which will cause the model to trust the raw data too much. If the raw data contain abnormal data, it will seriously affect the model effect and reduce the robustness. As can be seen from the above equation, when y_hot equals 0, y_ls equal to α/K, which is bigger than 0. When y_hot equals 1, y_ls equals 1 − α + (α/K), which is smaller than 1. By learning the soften value rather than the ground-truth label, the model can alleviate the problem of over-fitting.

Adabound
Stochastic gradient descent (SGD) and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. However, due to the unstable and extreme learning rates, there is a chance of causing performance degradation. By dynamically tailoring the learning rate, Adabound [44] can alleviate this problem. The core of the Adabound algorithm is to limit the rate of adaptive learning Clip(α/ V t , η l , η u ) where α is the initial step size, α/ √ V t is the learning rate of the algorithm, and η l and η u are the upper and lower bounds of the output. Adabound employ dynamic bounds on learning rates in these adaptive methods, where the lower and upper bound are initialized as zero and infinity, respectively, and they both smoothly converge to a constant final step size. This helps the model to converge faster and obtain higher accuracy.

Case Study
Paderborn University (PU) bearing datasets were provided by Lessmeier, C et al. [45] for condition monitoring and bearing fault diagnosis based on vibration and motor current signal. A total of 32 bearings of type 6203 were included, including six undamaged (healthy) bearings, 12 artificially damaged bearings, and 14 real damages caused by accelerated lifetime tests. Each of the 32 fault categories were tested on four different working conditions with different combinations of rotational speed, load torque, and radial force. All bearings were installed in the modular test rig for a unified test; the modular setup used to collect the PU bearing dataset is shown in Figure 5.
Since most of the current deep-learning-based fault diagnosis methods rely on simulated data in the laboratory, there is still a gap with the real situation. To solve this problem, this paper will conduct training on simulated data and test on real data, which can not only alleviate the problem of a small amount of real data, but also improve the accuracy of model transfer to real situations. The PU bearing dataset has artificial simulation and real bearing fault signal, which can help us complete this experiment. The visualization of signal in time domain and frequency domain is shown  Figure 6. Figure 6a,b shows healthy bearings, Figure 6c,d shows bearings from the artificial damage set, and Figure 6e,f from the natural damage dataset.

Case Study
Paderborn University (PU) bearing datasets were provided by Lessmeier, C et al. [45] for condition monitoring and bearing fault diagnosis based on vibration and motor current signal. A total of 32 bearings of type 6203 were included, including six undamaged (healthy) bearings, 12 artificially damaged bearings, and 14 real damages caused by accelerated lifetime tests. Each of the 32 fault categories were tested on four different working conditions with different combinations of rotational speed, load torque, and radial force. All bearings were installed in the modular test rig for a unified test; the modular setup used to collect the PU bearing dataset is shown in Figure 5. Since most of the current deep-learning-based fault diagnosis methods rely on simulated data in the laboratory, there is still a gap with the real situation. To solve this problem, this paper will conduct training on simulated data and test on real data, which can not only alleviate the problem of a small amount of real data, but also improve the accuracy of model transfer to real situations. The PU bearing dataset has artificial simulation and real bearing fault signal, which can help us complete this experiment. The visualization of signal in time domain and frequency domain is shown in Figure To perform artificial-to-natural few-shot meta learning, we selected 13 representative classes from the total 32 classes, with having a 1 healthy bearing, 8 artificial damaged bearings, and 4 real bearing failures resulting from accelerated lifetime testing. The test rig ran at n = 900 rpm with a load torque of M = 0.7 Nm and a radial force on the bearing of F = 1000 N; sampling frequency was 64 KHz and the vibration signal was perpendicular to the axis. The details of these 13 bearings are shown in Table 1. We conducted few-shot meta learning training on eight artificial damage categories, and tested under four real damages and one healthy state. Each category was tested with 1, 3, 5, 10 shots, and the accuracies are shown in Figure 7. Although Zhang et al. [30] proposed a few-shot learning strategy for bearing fault diagnosis, since the training and testing sets are of the same class, its essence is still a variety of transfer learning methods based on prior knowledge and fine-tuning.  6. Figure 6a,b shows healthy bearings, Figure 6c,d shows bearings from the artificial damage set, and Figure 6e,f from the natural damage dataset. To perform artificial-to-natural few-shot meta learning, we selected 13 representative classes from the total 32 classes, with having a 1 healthy bearing, 8 artificial damaged bearings, and 4 real bearing failures resulting from accelerated lifetime testing. The test rig ran at n = 900 rpm with a load torque of M = 0.7 Nm and a radial force on the bearing of F = 1000 N; sampling frequency was 64 KHz and the vibration signal was perpendicular to the axis. The details of these 13 bearings are shown in Table 1. We conducted few-shot meta learning training on eight artificial damage categories, and tested under four real damages and one healthy state. Each category was tested with 1, 3, 5, 10 shots, and the accuracies are shown in Figure 7. Although Zhang et al. [30] proposed a few-shot learning It can be seen that the few-shot meta learning method achieved the best accuracy in the 1-shot and 3-shot situation. In the 5-shot situation, the transfer learning method achieved the best accuracy. Meanwhile, in the 10-shot situation, the performance of the two methods is almost the same. The worst accuracy is also given with transfer learning in the 1-shot situation; the above results indicate that few-shot meta learning can effectively learn sample features in the case of extreme lack of samples, and the knowledge obtained from the artificial simulation fault data can be effectively transferred into natural fault data to improve the classification accuracy. Moreover, a statistical method called support vector machines (SVM) was used for comparison. For each training set, SVM used the whole training set to fit the model and achieved the worst results in all scenarios; this may be due to the small amount of data, meaning that the SVM could not learn the distribution of samples, and was unable to find the best hyperplane with the largest geometric margin.  It can be seen that the few-shot meta learning method achieved the best accuracy in the 1-shot and 3-shot situation. In the 5-shot situation, the transfer learning method achieved the best accuracy. Meanwhile, in the 10-shot situation, the performance of the two methods is almost the same. The worst accuracy is also given with transfer learning in the 1-shot situation; the above results indicate that few-shot meta learning can effectively learn sample features in the case of extreme lack of samples, and the knowledge obtained from the artificial simulation fault data can be effectively transferred into natural fault data to improve the classification accuracy. Moreover, a statistical method called support vector machines (SVM) was used for comparison. For each training set, SVM used the whole training set to fit the model and achieved the worst results in all scenarios; this may be due to the small amount of data, meaning that the SVM could not learn the distribution of samples, and was unable to find the best hyperplane with the largest geometric margin. Furthermore, Figure 8 shows the confusion matrices (a), (b), (c) and (d), corresponding to 1-shot, 3-shot, 5-shot and 10-shot tasks, where the prediction results and the ground truths are presented. Each episode, was repeated 1000 times to offset the randomness of sampling, and the average accuracy of 1000 times was taken.  Figure 9a shows the differences between RRN and transfer learning (TL) methods under label smoothing (LS) regularization and Adabound (Ada) optimizer improvements. It is seen that the predicted accuracy by the proposed method RRN is better than the result of the transfer learning in the 1-shot and 3-shot situation. The proposed method with label smoothing obtains the mean  Figure 9a shows the differences between RRN and transfer learning (TL) methods under label smoothing (LS) regularization and Adabound (Ada) optimizer improvements. It is seen that the predicted accuracy by the proposed method RRN is better than the result of the transfer learning in the 1-shot and 3-shot situation. The proposed method with label smoothing obtains the mean accuracy of 97.54%, the transfer learning method obtains the mean accuracy of 96.70%, and, compared with the mean accuracy of 97.38% obtained by the proposed method without label smoothing, this indicates that the label smoothing can improve the predicted accuracy effectively.
(c) (d) Figure 8. The confusion matrices in different tasks using the proposed method. (a-d), corresponding to 1-shot, 3-shot, 5-shot and 10-shot tasks. Figure 9a shows the differences between RRN and transfer learning (TL) methods under label smoothing (LS) regularization and Adabound (Ada) optimizer improvements. It is seen that the predicted accuracy by the proposed method RRN is better than the result of the transfer learning in the 1-shot and 3-shot situation. The proposed method with label smoothing obtains the mean accuracy of 97.54%, the transfer learning method obtains the mean accuracy of 96.70%, and, compared with the mean accuracy of 97.38% obtained by the proposed method without label smoothing, this indicates that the label smoothing can improve the predicted accuracy effectively. Moreover, the transfer learning method and RRN with Adabound optimizer is investigated, and the accuracies of the above methods are presented in Figure 9b. It is seen that Adabound is effective only for RRN and is inferior even to the baseline accuracy when added to the transfer learning method. The mean accuracies of the transfer learning method with Adabound, RNN with Adabound, RNN with label smoothing and Adabound are 96.36%, 97.59% and 97.47%, respectively.
Adabound can not only improve the accuracy of the model, but can also improve the convergence speed of the model. The average number of episodes used to stabilize the accuracy above 96% by using the Adam optimizer and Adabound optimizer is shown in Figure 10. To offset the deviation caused by random sampling, the above results are the average results after 1000 runs for each episode. In the 1-shot and 3-shot scenarios, the average episode required by the Adabound optimizer is 61 and 28, while for Adam it is 81 and 35, so the convergence speed of Adabound is faster. In the case of 5-shots or more, the convergence speed of the Adam and Adabound optimizer is gradually consistent. The metric-based, few-shot meta-learning was implemented by the Pytorch framework under Python 3.5. Training and network testing were performed on a personal computer Moreover, the transfer learning method and RRN with Adabound optimizer is investigated, and the accuracies of the above methods are presented in Figure 9b. It is seen that Adabound is effective only for RRN and is inferior even to the baseline accuracy when added to the transfer learning method. The mean accuracies of the transfer learning method with Adabound, RNN with Adabound, RNN with label smoothing and Adabound are 96.36%, 97.59% and 97.47%, respectively.
Adabound can not only improve the accuracy of the model, but can also improve the convergence speed of the model. The average number of episodes used to stabilize the accuracy above 96% by using the Adam optimizer and Adabound optimizer is shown in Figure 10. To offset the deviation caused by random sampling, the above results are the average results after 1000 runs for each episode.
In the 1-shot and 3-shot scenarios, the average episode required by the Adabound optimizer is 61 and 28, while for Adam it is 81 and 35, so the convergence speed of Adabound is faster. In the case of 5-shots or more, the convergence speed of the Adam and Adabound optimizer is gradually consistent. The metric-based, few-shot meta-learning was implemented by the Pytorch framework under Python 3.5. Training and network testing were performed on a personal computer with Windows 10 operating system, an Intel Core i7-9770F CPU, and a GTX 1660Ti GPU. For each episode, 10.4 s of average training time is required.
only for RRN and is inferior even to the baseline accuracy when added to the transfer learning method. The mean accuracies of the transfer learning method with Adabound, RNN with Adabound, RNN with label smoothing and Adabound are 96.36%, 97.59% and 97.47%, respectively.
Adabound can not only improve the accuracy of the model, but can also improve the convergence speed of the model. The average number of episodes used to stabilize the accuracy above 96% by using the Adam optimizer and Adabound optimizer is shown in Figure 10. To offset the deviation caused by random sampling, the above results are the average results after 1000 runs for each episode. In the 1-shot and 3-shot scenarios, the average episode required by the Adabound optimizer is 61 and 28, while for Adam it is 81 and 35, so the convergence speed of Adabound is faster. In the case of 5-shots or more, the convergence speed of the Adam and Adabound optimizer is gradually consistent. The metric-based, few-shot meta-learning was implemented by the Pytorch framework under Python 3.5. Training and network testing were performed on a personal computer with Windows 10 operating system, an Intel Core i7-9770F CPU, and a GTX 1660Ti GPU. For each episode, 10.4 s of average training time is required. Figure 10. The episodes required to achieve 96% accuracy. Figure 10. The episodes required to achieve 96% accuracy.
Combining the above analyses, RNN with Adabound can dominate in an extremely few-shot situation, and the conventional transfer learning method can dominate as the shot number increases. Furthermore, in this paper, eight fault categories of artificial damage are used for training, and five categories of natural damage are used for testing and verification. For the eight categories in the training set, natural damage is a new fault mode. Therefore, new fault classes can be recognized. However, as the randomness of sampling and the robustness of neural network cannot be proved, it can be seen from the above results that the robustness of the proposed method can be improved. In addition, the explanation for the black box of the neural network also needs further study in the future.

Conclusions
In this paper, a few-shot meta-learning method for bearing fault diagnosis is studied. The raw signal is converted into a frequency signal by FFT, so the time shift in the vibration signal need not be considered. In situations of transfer from artificial simulation to natural damage, the modules of feature exactor and metric learner are used to extract features and compute the similarity between learning features, respectively, which consequently automatically extracts features and enables classification. Metric-based meta learning methods, regularization methods, and the Adabound algorithm have been proposed and tested on real data. For the sake of fairness, all networks are unified with a 1D convolutional neural network.

•
A metric-based, few-shot, meta-learning framework is designed for bearing fault diagnosis, which is more suitable for a few-shot transfer scenario from the experimental situation to the actual working situation; • Comparison analysis among the designed few-shot meta-learning method and fine-tuning-based transfer-learning method is performed, showing that the proposed method has a better performance in the case of extreme data absence. The proposed method is 5% more accurate than the conventional transfer learning method and 65% higher than the conventional statistical method in extremely few-shot scenarios; • The label smoothing regularization method and Adabound optimizer can inhibit the overfitting in the learning process of small sample elements. The Adabound optimizer can help the model learn the data feature more quickly, and reduce mode training by up to 20 episodes.
In future work, we will further improve the stability and interpretability of the model, and reduce the number of hyper-parameters. In recent years, the few-shot meta learning method has been increasingly applied in fault diagnosis, which is a potential method for bearing fault diagnosis with few samples and condition transfers. The effectiveness of the proposed framework makes it a promising method for fault diagnosis.