Few-Shot Learning for Fault Diagnosis: Semi-Supervised Prototypical Network with Pseudo-Labels

: Achieving deep learning-based bearing fault diagnosis heavily relies on large labeled training samples. However, in real industry applications, labeled data are scarce or even impossible to obtain. In this study, we addressed a challenging few-shot bearing fault diagnosis problem with few or no training labeled samples of novel categories. To tackle this problem, we considered a semi-supervised prototype network based on few-shot bearing fault diagnosis with pseudo-labels. The existing prototypical networks with pseudo-label methods train a pseudo label model to label unlabeled samples using high-dimensional labeled data, which cannot eliminate the instability of the pseudo-label model caused by dimensional labeled features. To mitigate this issue, we used kernel principal component analysis to reduce the dimensions of and remove redundant information from high-dimensional data. Speciﬁcally, we used the pseudo-label prediction algorithm with probability distance to label unlabeled samples, aiming to improve the labeling accuracy. We applied two well-known bearing data sets for the validation experiments with symmetry parameters. The ﬁndings illustrated that the classiﬁcation accuracy of the proposed method is higher than that of other existing methods.


Introduction
Rotating machinery is an important component of smart manufactory, and its healthy and stability operation are required to guarantee production. However, due to bearings operating long term in harsh environments, then can easily fail, leading to disastrous consequences [1,2]. To ensure the safety and efficiency of smart manufacturing, rolling bearing fault diagnosis needs to be further studied, which has increasingly attracted research attention [3,4].
Benefitting from the rapid development of computer and sensing technologies, industry has entered the era of big data. Due to its big data learning ability, deep learning has replaced shallow models and has been successfully applied in various fields [5][6][7]. With continuous development and improvement, deep-learning models have been widely applied in the field of fault diagnosis. Gong et al. [8] use dan improved convolutional neural network support vector machine (CNN-SVM) method to effectively identify incipient faults in rotation machinery. Jiang et al. [9] explored a deep recurrent neural network (DRNN) to automatically extract the features from input spectrum sequences. Cui et al. [10] proposed a feature distance stack autoencoder (FD-SAE) for rolling bearing fault diagnosis to improve the feature extraction ability and the convergence speed of the network. Zhang et al. [11] combined an ensemble deep belief network and variation mode decomposition to improve the accuracy and stability of the diagnosis of the health status of rotating machinery. These existing methods, which are based on deep learning, can produce accurate results; however, a large amount of labeled data are required to obtain an effective deep model. However, obtaining enough labeled samples in actual industrial applications is difficult and timeconsuming. Therefore, more effort is needed to apply few-shot learning to solving the problems encountered in the practical application of deep-learning methods.
Few-shot learning, a technique of learning from a few labeled samples for automatically classifying massive amounts unlabeled samples, has recently attracted attention. Zhang et al. [12] proposed a few-shot learning approach for fault diagnosis under limited data conditions based on CNNs and Siamese neural networks. Jiang et al. [13] embedded a two-branch network into the prototype network, building a two-branch prototype network fault diagnosis method to mitigate the few-shot samples classification issue. Xu et al. [14] combined K-nearest neighbor with cosine distance to build a distribution discrepancy metric and developed a deep convolutional nearest neighbor matching network for few-shot learning. Wang et al. [15] developed a feature space metric-based meta-learning model to overcome the challenge produced by few-shot learning problem under limited labeled samples by adopting both individual sample information and similarity sample group information. Xu et al. [16], based on approximation space and belief functions, design edan few-shot learning method for fault diagnosis. They used the basic probability assignment calculation to build belief functions for diagnosis within sufficient information.
Although these existing few-shot learning fault diagnosis methods achieve encouraging fault diagnosis performance in the conditions with few labeled samples, these existing few-shot methods ignore the massive unlabeled data that exist in practical industrial applications, which may be used aspseudo-labels in combination with the few labeled samples to effectively train deep-learning models.
Semi-supervised few-shot learning, as a promising method for labeled samples, is increasingly receiving research attention. Tao et al. [17] designed a bearing defect diagnosis model using pseudo-labels, which obtained representative features for classification from unlabeled samples. Zhang et al. [18] used a Monte Carlo uncertainty threshold selection strategy to increase the confidence of the pseudo-labels, then used a momentum prototype network to obtain the feature space mapping using few labeled samples. Yong et al. [19] used an encoder to extract features for training prototypes, then semi-supervised metalearning, which they optimized by a combinatorial learning optimizer to refine original prototypes from unlabeled samples. Kai et al. [20] explored a pseudo-loss confidence metric for task-unified confidence estimation through mapping the different pseudo-labels to the same metric space using of the pseudo-loss. Di et al. [21] combined learner of latent representations with cluster structures, and proposed a pseudo-label-guided collective matrix factorization method for multi-view clustering. More recently, these semi-supervised few-shot learning methods have been widely studied, and encouraging results have been obtained. However, most of these existing methods focus on the confidence estimation inference of pseudo-label learning, which suffers when samples in a single task are insufficient, and ignore the redundant information embedded into feature space, which can cause errors in pseudo-label learning models, considerably decreasing the generalization capability of a model.
In an effort to achieve the semi-supervised few-shot learning, and motivated by the aforementioned analysis, and considering the influence of the redundant information in high-dimensional feature space, we designed a kernel principal component analysis method based on a semi-supervised prototypical network for fault diagnosis with pseudo-labels. We used kernel principal component analysis to reduce the dimensions of the feature space, which mitigates the effects of redundant information and results in a lightweight training model. We used the pretrained model, whose parameters we obtained by training model with these features through dimension reduction, to predict the labels of unlabeled data, and we selected the reliable labels as the pseudo-labels. We sent the pseudo-labeled data to the pretrained prototype networks to further fine tune the parameters to produce prototype networks with strong generalization ability. Finally, we conducted comparison experiments based on two well-known bearing datasets (Case Western Reserve University, CWRU) to prove the effectiveness of our method. The main highlights of the study are as follows: (1) We used kernel principal component analysis to reduce the dimension of the feature space, which avoid redundant information embedded in the feature space reducing the generalization ability of the model; (2) We used apseudo-label-prediction algorithm to generate labeled samples, aiming to increase the labeled samples, which fully uses the unlabeled samples for training the prototype networks to avoid overfitting; (3) We adopted predicted pseudo-label data to fine tune the prototype network parameters, which can reduce the time required for adjusting the model parameters and improves diagnostic accuracy.

Few-Shot Learning
Few-shot learning [22], which aims to learn about a new category from a small amount of labeled data, has aroused increased interest in the pattern recognition community. This technique has been extensively applied in artificial intelligence.
In the few-shot learning problem, all datasets are divided into three parts: the support set , where the data of support and query sets are from the same class, and the samples in the two sets are different. The sample categories of the test set differ from those of the support set. x i ∈ R D denote the feature vector extracted from a raw vibration signal; y i ∈ {1, 2, . . . , C} denotes the label of the dataset samples. In the traditional method, the dataset is only divided into a training and a test set. Compared with the traditional method, the support set S and query set Q are used to train the network, and the test set T is used to evaluate the performance of the network, which improves the stability and generalization of the model. If the support set contains N classes and K samples, it can be described as an N-way K-shot problem. The process of few-shot learning is illustrated in Figure 1.
(Case Western Reserve University, CWRU) to prove the effectiveness of our meth The main highlights of the study are as follows: (1) We used kernel principal component analysis to reduce the dimension of ture space, which avoid redundant information embedded in the feature sp ducing the generalization ability of the model; (2) We used apseudo-label-prediction algorithm to generate labeled samples, ai increase the labeled samples, which fully uses the unlabeled samples for t the prototype networks to avoid overfitting; (3) We adopted predicted pseudo-label data to fine tune the prototype netw rameters, which can reduce the time required for adjusting the model para and improves diagnostic accuracy.

Few-Shot Learning
Few-shot learning [22], which aims to learn about a new category from amount of labeled data, has aroused increased interest in the pattern recognitio munity. This technique has been extensively applied in artificial intelligence.
In the few-shot learning problem, all datasets are divided into three parts: t port set = ( , ) ( = × , ∈ ) , query set = , 1 × , ∈ , and test set = ( , ) , where the data of support and qu are from the same class, and the samples in the two sets are different. The samp gories of the test set differ from those of the support set.
∈ denote the featu tor extracted from a raw vibration signal; ∈ 1,2, … , denotes the label of the samples. In the traditional method, the dataset is only divided into a training an set. Compared with the traditional method, the support set and query set are train the network, and the test set is used to evaluate the performance of the n which improves the stability and generalization of the model. If the support set c N classes and K samples, it can be described as an N-way K-shot problem. The pr few-shot learning is illustrated in Figure 1.

Prototypical Network
Prototypical networks [23,24] generalize new classes not included in the train given only a small number of examples of each new class. Metric-based few-shot l has been widely used in few-shot learning, producing impressive results. An emb function is obtained by a neural network through prototypical network learnin samples are extracted into feature vectors. The mean vectors in each class are the type. During classification, query samples are first transformed into feature vect distance from the vector to the prototypes represents the similarity to the class. F describes the working of the prototype networks.

Prototypical Network
Prototypical networks [23,24] generalize new classes not included in the training set given only a small number of examples of each new class. Metric-based few-shot learning has been widely used in few-shot learning, producing impressive results. An embedding function is obtained by a neural network through prototypical network learning; then, samples are extracted into feature vectors. The mean vectors in each class are the prototype. During classification, query samples are first transformed into feature vectors; the distance from the vector to the prototypes represents the similarity to the class. Figure 2 describes the working of the prototype networks.  In few-shot learning, a support set of labeled samples = ( , given, where each ∈ is the dimensional feature vector D, and ∈ label of .
denotes k classes in the support set. Through an embed → , prototype is computed as follows: During the classification, distance is calculated by the distance fun probability that query point X belongs to class K can be expressed as:

KPAC
KPCA is a non-linear derivative of PCA that can be solved as an eig lem of its kernel matrix [25]. Samples are non-linearly mapped into highe feature space , and PCA is performed there.
Let sample , … , ∈ be mapped into ( 1 ), … , ( ) ∈ . T matrix in the feature space is given by: where ∑ ( ) = 0. Non-zero eigenvalues of te covariance matrix lated as: In few-shot learning, a support set of N labeled samples Through an embedding function f φ R D → R M , prototype p k is computed as follows: During the classification, distance is calculated by the distance function d(·); the probability that query point X belongs to class K can be expressed as:

KPAC
KPCA is a non-linear derivative of PCA that can be solved as an eigenvalue problem of its kernel matrix [25]. Samples are non-linearly mapped into higher-dimensional feature space F, and PCA is performed there.
Let sample The covariance matrix C in the feature space F is given by: where ∑ N k=1 φ(x k ) = 0. Non-zero eigenvalues λ of te covariance matrix C can be calculated as: where v denotes the corresponding eigenvector of F, which can also be written as: The problem is simplified to find the coefficient l k , which can be formulated as the following eigenvalue problem by substituting Equations (3) and (5) into (4), which is written as: where K is a kernel matrix with size N × N, which can calculated as follows: k(·) denotes the kernel function, which is used to calculate the inner product of φ(x i ) and φ x j . In this study, we used the RBF kernel function: where σ is set to 1/N. Let λ l be the lth largest eigenvalue of K, and α l = α l 1 , . . . , α N 1 be the corresponding eigenvector. An input sample x can be mapped onto the l th dimension of KPCA space with coordinate value: The advantage of kernel principal component analysis is that only the kernel function needs to be calculated in the original space; the nonlinear mapping function φ(x) does not need to be known.

Metric and Query
In this study, we used the Euclidean distance to calculate the similarity of samples through the feature vectors extracted from these samples. The distance can be represented as: where p n denotes the prototype of class n, x q i denotes the ith sample in query set, and f (·) is the feature vector extracted from the raw vibration signal. The smaller the distance d f , the more similar the query data to this class. The probability of the sample from query set x q i belonging to class k can be described as: In the process of pseudo-label learning, to retain samples with good classification performance, we used the experimental data to verify that, after SoftMax function screening, the samples whose probability value was greater than 0.7 had high classification accuracy. The detailed experiment of probability P is illustrated in Figure 3.
Then, the loss function of the samples selected for the query set was designed as:

Description of Proposed Method
In this study, we designed a kernel principal analysis based sem totypical network (PSSPN). The whole algorithm is described in Algorit Obtain samples and from support set and query set and generate support feature set ( ) and query feature set ( ).
Calculate the loss, and update parameter .

Description of Proposed Method
In this study, we designed a kernel principal analysis based semi-supervised prototypical network (PSSPN). The whole algorithm is described in Algorithm 1. Preprocess the raw data with KPCA 2.
Randomly sample N classes in dataset D L ; each class has K samples THAT consist of support set S e . Similarly, randomly sample Q samples to create query set Q e . 4.
Obtain samples x S i and x Q i from support set S e and query set Q e , respectively, and generate support feature set f ϕ x S i and query feature set f ϕ x Q i .

6.
Calculate the classification probability P y = k x q i by Equation (2). 7.
Calculate the loss, and update parameter ϕ. 8.
Use the model pretrained in steps 2-8 to predict the label of D U ; after selection, we obtain the pseudo-labeled dataset D pseudo . 9.
Fine-tune parameter ϕ with datasets D pseudo and D L .

End
Step (1)-Few-shot learning: the dimensions of the labeled data feature space are reduced by KPCA. Then, feed this reduced-dimension feature space into the prototypical network. Calculate the distance between samples in the query set and prototype. Then, convert the similar distances into probability values using a SoftMax classifier. Calculate the loss and update the parameters of the network. After iterating, the pretrained model is obtained.
Step (2)-Unlabeled samples data are preprocessed by KPCA: through the pretrained model, obtain the predicted label of the unlabeled data. After selection, retain part of the label, and then obtain the pseudo-labeled data.
Step (3)-Input the predicted pseudo-label samples to the labeled sample set to train and fine-tune the relevant parameters of the prototypical network.
A detailed description of the workflow of the proposed method is provided in Figure 4. Step (2)-Unlabeled samples data are preprocessed by KPCA: through the pretrained model, obtain the predicted label of the unlabeled data. After selection, retain part of the label, and then obtain the pseudo-labeled data.
Step (3)-Input the predicted pseudo-label samples to the labeled sample set to train and fine-tune the relevant parameters of the prototypical network.
A detailed description of the workflow of the proposed method is provided in Figure 4.

Results and Discussion
We used three methods, CNN, ProtoNet [24], and improved prototype network(IPN) [26], for a comparison experiment to verify the validity of the proposed method. The feature extractors of CNN and ProtoNet have the same network structure. IPN uses L2 regularization and a dropout layer, which help to address model overfitting problem. The architecture of PSSPN is described in detail in Table 1. We use leaky ReLU as the activation function, and α is 0.3. The network is optimized by the Adam optimizer, whose learning rate is0.001. We repeated the experiment 20times to obtain the final accuracy. We used Tensorflow 2.0 to conduct the experiment.

Results and Discussion
We used three methods, CNN, ProtoNet [24], and improved prototype network(IPN) [26], for a comparison experiment to verify the validity of the proposed method. The feature extractors of CNN and ProtoNet have the same network structure. IPN uses L2 regularization and a dropout layer, which help to address model overfitting problem. The architecture of PSSPN is described in detail in Table 1. We use leaky ReLU as the activation function, and α is 0.3. The network is optimized by the Adam optimizer, whose learning rate is 0.001. We repeated the experiment 20 times to obtain the final accuracy. We used Tensorflow 2.0 to conduct the experiment.  [27], which were collected under four different working conditions and loads (0, 1, 2, and 3 hp); the motor worked at speeds of 1979, 1772, 1750, and 1730 rpm, respectively. Each working condition had four bearing fault conditions: normal, ball fault, inner race fault, and outer race fault. Each fault type contained three fault sizes: 0.007, 0.014, and 0.021 inches. We provide details about the dataset in Table 2, which shows that there were 10 types in total. For the detailed experimental platform, please refer to the related references. We generated all the training and testing samples using a sliding window. We set the sliding window to 1024, and step length of the framing to 80. The detailed information about the dataset used in the experiment is provided in Table 3.  Pretrain  500  500  500  500  500  500  500  500  500  500  1  Unlabeled  1000  1000  1000  1000  1000  1000  1000  1000  1000  1000  Test  200  200  200  200  200  200  200  200  200  200 4.1.

Results Analysis
In this case, study, 1-shot and 5-shot experiment is conducted on the dataset descripted as Table 3, all parameters of compared methods are mentioned above, and the experiment result is listed in Table 4 Table 4 shows that the classification accuracies of the few-shot learning methods are much higher than that of conventional CNN. For 1-, 5-, and 10-shot learning, the PSSPN achieved 89.72%, 94.65%, and 97.05% accuracies, respectively, which are higher than those of the other considered methods. The classification accuracy of ProtoNet with KPCA is higher than that of ProtoNet. This finding showed that KPCA helps remove redundant information, which cause errors in the model during training. Figure 5 shows the confusion matrix for the five-shot classification accuracy result of PSSPN. We found that the We generated accuracy of PSSPN was high for various bearing fault types. However, for label 8, the classification accuracy was relatively low, and several samples were categorized into label 9. The most probable reason for this is that the difference between these two samples is small, which lead to misclassification. With the increase in the number of training samples, the classification accuracy of the various methods also increased, especially that of ProtoNet and IPN. The classification accuracy of PSSPN was higher when the number of labeled training samples was small. metry 2022, 14, x FOR PEER REVIEW achieved 89.72%, 94.65%, and 97.05% accuracies, respectively, whi those of the other considered methods. The classification accuracy KPCA is higher than that of ProtoNet. This finding showed that K redundant information, which cause errors in the model during traini Figure 5 shows the confusion matrix for the five-shot classificat of PSSPN. We found that the We generated accuracy of PSSPN w bearing fault types. However, for label 8, the classification accuracy and several samples were categorized into label 9. The most probab that the difference between these two samples is small, which lead With the increase in the number of training samples, the classificat various methods also increased, especially that of ProtoNet and IPN accuracy of PSSPN was higher when the number of labeled training s   Figure 6 illustrates the feature visualization produced for t-SNE. In CNN, the difference in some classes was clear, but several features were indivisible. In ProtoNet and IPN, different samples were successfully distinguished, but the result was messy. The results produced by PSSPN were more clearly divided than that of the other considered methods, though the boundaries of some samples were ambiguous. Possible reasons for this include the amount of data being relatively small, and the model could not be further improved; or the difference between the samples being small, and our model could not distinguish this gap.

Petrochemical Dataset Introduction
The petrochemical dataset(Guangdong Provincial Key Laboratory of P Equipment Fault Diagnosis, Guangdong University of Petrochemical Maoming, China) [29,30] contains more noise and is related to industrial en We established a simulation platform to simulate the actual working envir petrochemical refinery and the power load of rotating machinery. The de mation of the platform used for data collection from the machinery is provid 7. For more detailed information, please refer to the provided references.

Petrochemical Dataset Introduction
The petrochemical dataset(Guangdong Provincial Key Laboratory of Petrochemical Equipment Fault Diagnosis, Guangdong University of Petrochemical Technology, Maoming, China) [29,30] contains more noise and is related to industrial environments. We established a simulation platform to simulate the actual working environment of a petrochemical refinery and the power load of rotating machinery. The detailed information of the platform used for data collection from the machinery is provided in Figure 7. For more detailed information, please refer to the provided references.
Equipment Fault Diagnosis, Guangdong University of Petrochemical Technology, Maoming, China) [29,30] contains more noise and is related to industrial environments. We established a simulation platform to simulate the actual working environment of a petrochemical refinery and the power load of rotating machinery. The detailed information of the platform used for data collection from the machinery is provided in Figure  7. For more detailed information, please refer to the provided references.   Table 5. The petrochemical dataset has six fault classes, and each sample has 1024 sampling points. In this case, study experiment, we chose 500 samples for training, and 1000 unlabeled samples and 200 labeled samples for evaluation. Detailed dataset information is listed in Table 5. The parameters of the model were the same as used for CWRU dataset. We determined the classification accuracy of all considered methods, as provided in Table 6. As shown in Table 6, compared with the other few-shot learning strategies, the lowest classification accuracy was obtained by CNN. However, both ProtoNet and IPN achieved accurate classification. In 10-and 30-shot, IPN reached 100% classification accuracy. Adding KPCA, the accuracy of slightly ProtoNet improved. For of one-and five-shot learning, PSSPN performed the best, showing that the model can deal with the situations when data are scarce.
The confusion matrix of five-shot classification accuracy is shown in Figure 8. Many samples were correctly classified, and the average classification accuracy was 96.2%.  Figure 9 shows feature visualization via t-SNE, showing that CNN could not clearly distinguish different fault classes. ProtoNet and IPN performed better thanCNN: the distance between features of different classes was as large as possible and the features in the same classes were close to each other. However, some features from different classes were still close to each other. Our proposed method is relatively more accurate than the others in t-SNE. In our method, different classes have well-defined boundaries.   Figure 9 shows feature visualization via t-SNE, showing that CNN could not clearly distinguish different fault classes. ProtoNet and IPN performed better thanCNN: the distance between features of different classes was as large as possible and the features in the same classes were close to each other. However, some features from different classes were still close to each other. Our proposed method is relatively more accurate than the others in t-SNE. In our method, different classes have well-defined boundaries.  Figure 9 shows feature visualization via t-SNE, showing that CNN could not clearly distinguish different fault classes. ProtoNet and IPN performed better thanCNN: the distance between features of different classes was as large as possible and the features in the same classes were close to each other. However, some features from different classes were still close to each other. Our proposed method is relatively more accurate than the others in t-SNE. In our method, different classes have well-defined boundaries.

Conclusions
We presented a kernel principal component analysis method with a semi-supervised prototype network (PSSPN) for few-shot bearing fault diagnosis. This method can be used when few labeled samples are available and makes full use of the unlabeled data to train the model. KPCA is used for avoiding the dimensionality problem to improve the accuracy of the classification results. We used pseudo-labeled data to fine-tune the pretrained model to avoid the problem of model overfitting. We used two datasets to evaluate the performance of the proposed method, and the results showed that compared with two other methods, the classification accuracy of our proposed method is higher when few labeled samples are available. In the future, we will improve the model to deal with data that are difficult to distinguish and increase the accuracy of the classification result.