Domain-Adaptive Prototype-Recalibrated Network with Transductive Learning Paradigm for Intelligent Fault Diagnosis under Various Limited Data Conditions

In real industrial scenarios, intelligent fault diagnosis based on data-driven methods has been widely researched in the past decade. However, data scarcity is widespread in fault diagnosis tasks owning to the difficulties in collecting adequate data. As a result, there is an increasing demand for both researchers and engineers for fault identification with scarce data. To address this issue, an innovative domain-adaptive prototype-recalibrated network (DAPRN) based on a transductive learning paradigm and prototype recalibration strategy (PRS) is proposed, which has the potential to promote the generalization ability from the source domain to target domain in a few-shot fault diagnosis. Within this scheme, the DAPRN is composed of a feature extractor, a domain discriminator, and a label predictor. Concretely, the feature extractor is jointly optimized by the minimization of few-shot classification loss and the maximization of domain-discriminative loss. The cosine similarity-based label predictor, which is promoted by the PRS, is exploited to avoid the bias of naïve prototypes in the metric space and recognize the health conditions of machinery in the meta-testing process. The efficacy and advantage of DAPRN are validated by extensive experiments on bearing and gearbox datasets compared with seven popular and well-established few-shot fault diagnosis methods. In practical application, the proposed DAPRN is expected to solve more challenging few-shot fault diagnosis scenarios and facilitate practical fault identification problems in modern manufacturing.


Introduction
With the advancement of the Industrial Internet of Things (IIoT), modern manufacturing based on the complex industrial system is progressing towards high-level precision and speed, accompanied by high demand for reliability and health management [1]. In engineering practice, any unexpected failure of large industrial mechanical equipment may lead to serious casualties and economic losses [2,3]. Hence, accurate fault diagnosis, which is of great importance for equipment health management, has received more and more attention from both academia and industry.
Recently, a lot of intelligent fault diagnosis methods based on deep learning (DL) models which can automatically recognize health conditions of machinery in an end-to-end manner have made major advances, such as sparse auto-encoder (SAE) [4], convolutional neural network (CNN) [5,6], generative adversarial network (GAN) [7,8], graph neural network (GNN) [9], long short-term memory network (LSTM) [10], and deep belief network (DBN) [11]. In general, DL-based fault diagnosis models directly learn the mapping between the training dataset with available labels and recognize the unseen health conditions in the testing dataset, and the training and testing datasets are implicitly in the same distribution. To solve the cross-domain fault identification, transfer learning (TL), which leverages learned diagnostic knowledge from similar and relative tasks in the source domain and identifies the health conditions of the target domain, has been developed in recent years [12][13][14][15][16][17][18]. In the TL scheme, the health conditions of source and target domains are generally the same, except for some partial TL methods [19][20][21][22] and open set TL methods [23,24]. Moreover, the TL scheme is also data-greedy, similar to the DL scheme. Based on the above analysis, both traditional DL and TL have been successfully implemented in data-intensive applications but they are usually hampered when data are limited. In the field of fault diagnosis, the original data are collected and classified by faulty types and operating conditions. In modern production, it is apparent that machinery does not always work under high load and speed, except in emergencies. Additionally, the fault identification and health management system does not allow machines to work in faulty states; only limited original data with supervised labels are collected in practice. Unfortunately, these diagnostic tasks with data-scarce cases are common in real-world modern machinery. Consequently, intelligent fault diagnosis with scarce data is a scenario with ever-increasing demand in practical engineering.
Few-shot learning (FSL) [16] is a promising tool to solve the above challenge, which aims to train a classifier to diagnose unseen health conditions during the training process with limited labeled data. Motivated by strong generalization ability of humans to perform even one-shot classification, FSL is widely explored in several fields [17], for instance, image classification [18], intrusion detection [19], disease prediction [20], and fault diagnosis [21]. To recognize the health conditions of bearings, Zhang et al. [22] proposed a few-shot diagnosis model based on the Siamese neural network (SiameseNet), which learns the similarity-based information in feature space by inputting sample pairs of the same or different classes. Ren et al. [1] presented a novel few-shot model composed of an auto-encoder and capsule network, which shows high performance on few-show fault diagnosis tasks. Inspired by the model-agnostic meta-learning (MAML), a few-shot learning framework whose fault classifier is trained with limited target data is proposed for bearing fault diagnosis [23], the MAML-based framework is further validated by recognizing new bearing damages of a public dataset. Combined with supervised domain adaptation and prototype learning, a Siamese architecture is proposed to learn the domain-invariant space for bearing fault identification with limited data [24]. As a variation of the metric-based meta-learning method, Wang et al. [25] presented a reinforced relation network for rolling element bearing fault identification with few-shot samples. Based on the deep metric of relation network (RelationNet) [26], Wu et al. [27] proposed a unified convolutional neural network for few-shot diagnosis tasks for machinery. Further, a metric-based meta-learning model with global supervised training and episodic training in feature space, which is extended from the matching network (MatchingNet) [28] and prototypical network (ProtoNet) [29], is proposed for fault diagnosis of machinery under various limited data conditions [21].
Through the systematic literature review, it can be concluded that these existing networks with few-shot generalization ability have made great progress in fault diagnosis with limited data in real industrial scenarios. Two main branches of the FSL, i.e., optimizationbased meta-learning [23,30] and metric-based meta-learning [21,24], have been applied in the intelligent fault diagnosis of machinery. Moreover, the metric-based meta-learning methods show great potential in the field of few-shot fault diagnosis. Despite the various design of network architecture in metric-based meta-learning methods, these methods are composed of a network-based feature extractor for high-level representation extraction and a similarity-based classifier for fault diagnosis with unseen health conditions. During the meta-learning phase, the episodic training method is applied to randomly sample mini-batches as episodes to mimic the few-shot tasks in the source dataset with base classes and generalize the diagnostic knowledge to facilitate few-shot fault diagnosis tasks in the target dataset with novel classes. Nonetheless, there are still two drawbacks to these aforementioned studies. (1) Note that most existing few-shot diagnosis methods make the implicit hypothesis that the source and target data are from the same distribution. However, this hypothesis is not valid due to the huge difference in the working conditions of machinery in real-industrial scenarios. (2) Metric-based meta-learning methods require identifying faults of machinery with limited labeled samples in practical engineering. However, the narrow-size target distribution, which relies heavily on scarce data from the target domain, tends to train a biased similarity-based classifier.
To solve the above-mentioned issues, a novel FSL method named domain-adaptive prototype-recalibrated network (DAPRN) with transductive learning paradigm and prototype recalibration strategy (PRS) is proposed for intelligent fault diagnosis of machinery under various limited data conditions. The proposed DAPRN, which is inspired by the idea of ProtoNet, is composed of a feature extractor, a domain discriminator, and a cosine similarity-based label predictor. Integrating the domain adaptation into the meta-training process, the feature extractor of DAPRN learns a domain-invariant representation space between the source and target domains through joint optimization of minimizing the few-shot classification loss and maximizing the domain-discriminative loss. The prototypes, which denote the representations of health conditions in the latent metric space, are recalibrated by the PRS to reduce the bias of naïve prototypes and promote few-shot fault diagnosis. Hence, the trained feature extractor and label predictor are capable of identifying the health condition of machinery in the meta-testing process.
The contributions of the work can be summarized as follows.
(1). To address fault diagnosis with limited data, an innovative end-to-end DAPRN which is made up of a feature extractor, a domain discriminator, and a label predictor is presented. In the training process, the feature extractor learns a representation space by a hybrid training strategy combined with the minimization of few-shot classification loss and the maximization of domain-discriminative loss. In the testing process, the label predictor with recalibrated prototypes can recognize the health conditions of target samples using the generalized meta-knowledge of source diagnostic tasks. (2). The structure of the feature extractor is appropriately discussed. In addition, to explore whether and how the data capacity and category richness of the source dataset regulate the performance of few-shot fault diagnosis, a series of experiments are designed and carried out. The details of experimental results are analyzed and discussed thoroughly. (3). To test the validity and superiority of the DAPRN, extensive few-shot fault diagnosis tasks of rolling element bearings and planetary gearboxes under various limited data conditions are conducted. Compared with existing popular FSL methods, in-depth quantitative and qualitative analysis convincingly demonstrates that the proposed method significantly improves the performance of the few-shot fault diagnosis. In addition, ablation studies are implemented to further verify the advance of the proposed method.
The rest of this paper is structured as follows. The background knowledge, including few-shot learning for fault diagnosis, prototypical network based on deep metric learning, and transductive few-shot learning paradigm, is provided in Section 2. Detailed descriptions of the proposed DAPGN are given in Section 3. Section 4 provides an indepth discussion on multiple few-shot tasks. Finally, Section 5 concludes this research and proposes prospects for future work.

Few-Shot Learning for Fault Diagnosis
The workflows of traditional DL, TL, and FSL are demonstrated in Figure 1. Shown in Figure 1a, a supervised fault diagnosis task usually is composed of a training set (source domain) with labeled data for training the traditional DL models and a testing dataset (target domain) for model evaluation. It should be pointed out that the training and testing datasets are drawn from one probability distribution. Shown in Figure 1b, a cross-domain fault diagnosis task consists of a source dataset with labeled data and a target dataset with unlabeled data, and these two datasets are collected from different probability distributions. The TL models, which are trained on the source and target domains employing unsupervised domain adaptation, can correctly recognize samples of the unlabeled target domains. Shown in Figure 1c, the traditional fault diagnosis task is termed a few-shot task when the training dataset contains limited samples with supervised labels. Unlike traditional DL and TL, the source dataset of FSL, whose label space is disjointed from that of the target domain, is named base or auxiliary set with a large amount of labeled data. Note that the training dataset (i.e., support set) with scarce annotated data and testing dataset (i.e., query set) are sampled from the target domain and share the same label space. For clarity, the differences between the above three methods are summarized in Table 1.  Referring to the few-shot classification in computer vision [25,26], the detailed definition of few-shot fault diagnosis is shown as follows. Meta-learning, which is motivated by the learning process of humans and is also known as "learning to learn", aims to obtain the ability to learn meta-knowledge on higher-level tasks by episodic training. Specifically, there are three datasets in the few-shot fault diagnosis problem: a base set (denoted as B), a support set (denoted as S), and a query set (denoted as Q). For an N T -way K-shot M-test few-shot fault diagnosis task (denoted as T T ) from target domain, the support set S T consists of K samples per class, and the query set Q T consists of M unlabeled examples per class. Note that the support set S T is composed of M T labeled samples where each x T i ∈ R D is the D-dimensional representation vector of the original vibration signals and y i ∈ 1, 2, . . . , N T is the corresponding health condition, and M T is equal to N T × K. The base set B consists of M S labeled samples x S 1 , y 1 , . . . , x S M S , y M S in the source domain where each x S i ∈ R D is raw vibration signal and y i ∈ 1, 2, . . . , N S is the corresponding health condition. It should be pointed out that the N S is equal or greater than N T in practice. During the meta-training phase, the FSL models randomly sample several similar source tasks T S s by imitating the target task T T . The glamor of episodic meta-learning is to tackle the few-shot fault diagnosis by generalizing from T S based on a known source domain to T T on the target domain. During the meta-testing phase, a trained metric-based model is capable of recognizing the health conditions of the query set Q T .

Prototypical Network Based on Metric Learning
The prototypical network (ProtoNet) [27], proposed by Snell et al. in 2017, is becoming one of the most typical and popular metric-based meta-learning methods for FSL applications in various fields. Employing metric learning, ProtoNet can learn a proper metric space in which recognition is conducted by distance measurement between learned prototypes of each class and samples.
Metric learning, also termed distance metric learning, aims to learn a metric function to measure the similarity among samples automatically. Generally, the metric learning methods usually learn a linear metric function that transforms the samples into representations in a shallow feature space. Owing to the limited ability of metric learning to process complex and massive data, DL-based deep metric learning, one main branch of metric learning, exploits deep architecture to obtain embedded representations for the measurement of similarity [28]. In ProtoNet, the process of deep metric learning is shown in Figure 2. The original data, such as two-dimensional (2D) images and one-dimensional (1D) signals, are fed into one deep decoding network to obtain the representation of the original data in feature space. Then, a metric (e.g., Euclidean, Cosine, Mahalanobis, and Kullback-Leibler) is utilized to make the embeddings per class closer to each other in metric space as the training continues. As shown in Figure 3, the scheme of ProtoNet consists of two main stages: prototype generation and distance measurement. As the key element of ProtoNet, representation averaging is conducted to generate prototypes for each class. The similarity between query samples and learned prototypes is measured, and the query samples can be correctly classified into a certain category.

Transductive Few-Shot Learning Paradigm
With the rapid advance in DL algorithms and computational power, inductive learningbased methods have achieved unprecedented success in traditional supervised tasks [29]. According to the inductive learning paradigm, a machine learning model, which was trained on a training dataset to generalize rules, is applied to predict a testing dataset.
In contrast to inductive learning, transductive learning-based methods encountered both training and testing datasets during the training process [30].
Recently, a large body of research has exploited transductive inference in few-shot applications [31][32][33]. In the transductive few-shot learning paradigm, the FSL models will use both the unlabeled samples in the query set and labeled samples in the support set during the meta-training phase, instead of training on support samples in inductive learning-based methods. For instance, Dhillion et al. proposed a transductive few-shot baseline that improves the few-shot classification performance by minimizing the entropy on the predictions of query samples during the meta-training phase [32]. Further, the transductive relation-propagation network fully exploits the relations between the support set and query set by relation propagation [34]. There is a great deal of research indicating that transductive learning-based methods, which serve as an increasingly appealing method to solve few-shot tasks, outperform inductive learning-based methods in many few-shot classification scenarios [29,35].

The Architecture of Proposed DAPRN
In this work, intelligent fault diagnosis, in which a model trained on a support set S (i.e., training set) with limited labeled data is utilized to predict the health conditions of an unlabeled query set Q (i.e., testing set), is discussed. Paradoxically, a model trained on the support set S with scarce data is usually overfitting and not conducive to effective fault diagnosis on the query set Q in practical engineering. Given the commonality in machine fault diagnosis tasks, such as comparable rolling element bearings installed in different positions or the same position under various operating conditions, it is expected that a somewhat complex few-shot fault diagnosis task can be accomplished if a related labeled dataset with relatively sufficient data, also known as base set B, is available.
The architecture of the proposed end-to-end DAPRN for a few-shot fault diagnosis of machinery is shown in Figure 4. For the fault diagnosis tasks with limited data, the DAPRN is fed with original 1D vibration signals and trained by the episodic training method. It must be noticed that the DAPRN, which is extended from the naïve ProtoNet, is a transductive metric-based meta-learning method. As illustrated in Figure 4, the DAPRN is composed of three main parts, including a feature extractor f FE , a domain discriminator f D , and a prototype-based label predictor f LP . Given the dimension of original vibration signals, the feature extractor f FE is designed as a 1D CNN with powerful feature extraction ability from sequential data. Following [36,37], the feature extractor f FE consists of multiple 1D convolutional and maxpooling layers. More specifically, the convolution operation extracts local temporal features from original vibration signals and the max-pooling operation recognizes the most important features from the output of the last convolutional layer. The domain discriminator f D , which is connected with the feature extractor f FE , is designed with two fully-connected layers. During the training phase, the flattened representation vectors of source and target domains are input into the f D . Due to transductive domain adaptation, the learned metric space of the feature extractor f FE is adapted to enhance the ability of generating more effective task-specific representations in the target domain during the beta-testing phase. The prototype-based label predictor f LP is a module based on cosine distance and prototype recalibration without trainable parameters. It is noted that the refined predictor is the trained predictor f LP with prototype recalibration during the training phase.

Optimization Objective Function
As illustrated in Figure 4, there are two optimization objectives in the training procedure of DAPRN. (1) For the feature extractor f FE with trainable parameters θ FE , minimizing the few-shot classification error L C of the source support-query data guides the f FE to learn an effective metric space for few-shot fault diagnosis tasks. (2) For the domain discriminator f D with trainable parameters θ D , maximizing the domain-discriminative loss L D of data of both domains is capable of achieving marginal distribution alignment, thus further benefiting the task-specific representation extraction in the target domain.
For an N-way K-shot fault diangosis task T T , the base set B is composed of a series of annotated subsets D S n n = 1, . . . , N S N S ≥ N . At the beginning of each training epoch, N out of N S categories are randomly selected to mimic the T T , and each selected subset D S n for category n is chosen with K samples as support set randomly and the rest of the subset is the query set with M train samples Hence, several analogous data-structure few-shot tasks T S s are obtained in the source domain for the whole training procedure. For convenience, the X S with M S samples and X T with M T samples represent all chosen samples in source and target domains during one training epoch, respectively.
The proposed DAPRN obtains a prototype in the same way as the naïve ProtoNet. Each prototype P n , which is an average vector of the embedded representations obtained by the feature extractor f FE : R D → R H , is defined as follows: Different from the naïve ProtoNet using Euclidean distance for metric learning, cosine similarity has gained popularity in few-shot fault diagnosis of machinery [37]. Given a cosine similarity function d cos : 1] , the label predictor f LP generates a softmax-based probability p over all categories for a sample x S Q,i of query set Q S in the embedding metric space as follows: Based on the probability p m,n of the mth sample for category n in an episode, the standard cross-entropy loss is adopted to minimize the one episodic classification loss L C as follows: As shown in Figure 4, the domain discriminator f D is fed with the embedded representations f FE (x) with the input being x ∈ X S ∪ X T . In this work, the source-domain and target-domain labels are set to 0 and 1, respectively. The output probability p d of the domain discriminator f D is calculated by a softmax layer in the binary cross-entropy loss as follows: As a result, the domain-discriminative loss can be computed as follows: In the training process, the L C is minimized to obtain a metric space for few-shot fault diagnosis, and the L D is maximized for making two task spaces as similar as possible. To make the training process feasible, a gradient reversal layer (GRL) [38], which is an identity mapping during the forward propagation process and reverses the gradient by multiplying −1 during the backpropagation process, is implemented to connect the feature extractor f FE to the domain discriminator f D . Taken altogether, the two optimization objectives of the DAPRN could be summarized as follows: During the training process, the feature extractor f FE is updated by minimizing the few-shot classification loss L C and maximizing the domain-discriminative loss L D concurrently. Meanwhile, the domain discriminator f D is optimized by minimizing the domaindiscriminative loss L D . Consequently, the optimization process seeks relatively optimal parametersθ FE ,θ D that deliver a saddle point of the overall objective L as follows: Accordingly, the trainable parametersθ FE andθ D of the DAPRN can be optimized during each training epoch as follows: where is the learning rate. When the training procedure is finished, the prototype-based class predictor f LP is capable of recognizing the unlabeled samples in the target domain with a target task-specific metric space generated by the trained feature extractor f FE .

Prototype Recalibration Strategy
During the meta-testing procedure, the unlabeled samples of target query set Q T = , where M indicates the number of query samples per category, can be identified by finding the most similar prototype based on the trained feature extractor f FE previously and label predictor f LP . However, the naïve prototype, which is determined by such a limited-data regime, is not the one we expected to find in practice. To alleviate the prototype bias caused by the narrow-size target distribution, a PRS is introduced to improve the few-shot fault diagnosis in the meta-testing phase.
Given a target support set with K samples for category n, the PRS obtains the naïve prototype P n as follows: A pseudo-labeling approach, which adds provisional labels for query samples according to prediction confidence, is utilized to obtain a recalibrated prototype in a high-data regime to decrease the bias of naïve prototypes. However, a simple averaging of representation generated from the support set and labeled query set might lead to a worse bias in practice. Therefore, a weighted strategy is carried out to reassign the contribution of the sample from the enhanced support set to the recalibrated prototypes. In detail, top Z samples with the pseudo label being category n are chosen from the query set, then a pseudo-labeled query subset Q T pn is obtained. Specifically, the w i,n represents the contribution of the ith sample from the enhanced support set S T n ∪ Q T pn to the recalibrated prototype P r n . Then, the recalibrated prototype P r n of category n is computed with reassigned weights w i,n as follows: where x T i indicates a sample from the dataset S T n ∪ Q T pn . Notably, the weight w i,n is defined by a softmax operation and cosine distance d cos as follows: In contrast to the naive prototype, the recalibrated prototype, which is determined in a high-data regime, is closer to the optimal prototype in practice.

Transductive Training and Testing Method
For a vanilla metric-based meta-learning method for fault diagnosis, standard episodic training is capable of training a generalized model to novel health conditions in practical engineering. However, this conventional training method is insufficient for the proposed DAPRN. In particular, the training process of the proposed DAPRN is composed of a conventional meta-training procedure and a transductive domain adaptation, whereas the testing process is a PRS-based meta-testing approach. In the training process, the unsupervised domain adaptation via maximization of domain-discriminative loss L D is integrated into the minimization of the few-shot classification loss L C . The penalty term of domain-discriminative loss L C is dynamically changed from 1 to 0 and set to µ = 2 − 2/(1 + exp(−10 × tp)), and tp denotes the training progress which increases from 0 to 1 gradually. It is worth noting that the domain adaptation using source-domain and target-domain data makes the training procedure a transductive process. In the testing process, a PRS is introduced to solve the basis of the naïve prototype and generate a recalibrated prototype for few-shot fault diagnosis problems.
The Algorithm 1, which includes a transductive training procedure and a few-shot fault diagnosis process based on PRS, is demonstrated as follows. 4. For episode = 1 to n e 5. Sample few-shot tasks T S s from S S n and Q S n , then fed them into f FE and f LP 6. Calculate few-shot classification loss L C 7. Sample few-shot tasks T T s from S T n and Q T n , and fed them into f FE 8. Input representation vectors of source and target domain to f DD 9. Compute the training progress tp = epoch/n E 10. Calculate domain-discriminative loss L D 11. Backpropagation with µ for L D 12. Optimize the parameters of θ FE and θ D as follows

Few-shot fault diagnosis based on PRS
, a target-unlabeled query set , number of episodes n e , number of Z, and trained feature extractor f FE Output: Prediction results and average accuracy 1. For epoch = 1 to n e 2.
Sample few-shot tasks T T s from S T n and Q T n , and fed them into f FE 3. Recalibrate naïve prototypes by PRS as follow P r n → PRS Q T n , P n , Z 4. Obtain health conditions of Q T n by refined f LP with P r n 5. End

Bearing Dataset
To verify the effectiveness and superiority of the proposed DAPRN, the raw vibration signature of bearings, which is collected by the bearing center of Case Western Reserve University (CWRU), is used [39]. As illustrated in Figure 5, the CWRU bearing test rig is composed of three main parts called the induction motor, the torque transducer, and the load motor. Specifically, an accelerometer which is placed at the 12 o'clock position of the driving end bearing of the induction motor is applied to obtain the signature with a sampling frequency of 12 kHz. During experiments, three types of common bearing faults, outer race fault, inner race fault, and ball fault, are simulated by electrical discharge machining. According to the length of artificial damage, three levels of faults (i.e., 0.007 inches, 0.014 inches, and 0.021 inches) were, respectively, seeded on the test bearings. Hence, there are ten health conditions or categories in the CWRU bearings dataset, including one normal health condition and nine types of faults. After that, four different working loads (i.e., 0 hp, 1 hp, 2 hp, and 3 hp) are provided by the load motor to stimulate various working conditions in real industrial scenarios. In our few-shot fault diagnosis, each category contains 200 samples and each sample is a raw vibration signature with 1024 data points. Specifically, the data samples of both normal and faulty conditions are displayed in Figure 6. In addition, the more experimental details of the bearing dataset setting are summarized in Table 2.

Gearbox Dataset
Considering that the gearboxes play an important role similar to bearings in modern manufacture, another industrial-use gearbox dataset collected from a two-stage planetary gearbox is utilized to further verify the proposed method on few-shot fault diagnosis tasks. As demonstrated in Figure 7, this gearbox testbed is composed of a variable-speed servo driving motor, a torque transducer, a two-stage planetary gearbox, an assistant planetary gearbox, and a brake. As shown in Figure 7, the accelerometer is placed in the horizontal direction of the second-stage surface of the test gearbox housing to continuously acquire an external vibration signature with a sampling rate of 25.6 kHz. Before experiments, three types of faults with the faulty diameter being 1 mm are simulated on the second stage of the test gearbox, and these faults, including sun gear fault (SF), planetary gear fault (PF), and ring gear fault (RF), are displayed in Figure 8. In this gearbox test rig, each experiment was carried out under three different working conditions. Concretely, the working loads varied between 50 Nm, 150 Nm, and 150 Nm, with a constant rotating speed being 1500 rpm. Similar to the above bearing dataset, each category of the gearbox dataset consisted of 200 samples with 1024 data points. In particular, this dataset's settings are outlined in Table 3.

Implementation Details
As the indispensable components of mechanical equipment in the production of modern industry, bearings and planetary gearboxes are prone to experience damage accidents in operation. To achieve a few-shot fault diagnosis of the above two key components, the detailed network structures of the feature extractor and domain discriminator for the proposed DAPGN are, respectively, shown in Tables 4 and 5. It needs to be pointed out that #Param means the number of trainable parameters to the corresponding layer. As demonstrated in Table 4, the feature extractor is composed of one input layer, four 1D convolution-max-pooling blocks (ConvB), and one flattened output layer. Batch normalization (BN) [40], which is used to reduce the internal covariate shift and speed up the network training, is embedded into the proposed network. Furthermore, rectified linear unit (ReLU), one of the most commonly used activation functions, is applied to introduce the nonlinear transformation into the model. Note that the Conv1D, BN-ReLU, and Max-Pool1D denote a 1D convolutional layer, a combination of the BN and ReLU, and a 1D max-pooling layer, respectively. In the feature extractor, the stacked four ConvBs play a role in obtaining feature representation from the original vibration signals layer by layer. For instance, in the ConvB1, raw vibration signals whose size is [1 × 1024] are fed into the proposed network. Concretely, the raw vibration signals are input into the Conv1D whose kernel size is 16 × 1 with stride being 2 before the BN layer and ReLU, and the output whose size is [8 × 252] is obtained by the MaxPool1d with kernel size being 2 and stride being 2. In addition, the domain discriminator consists of two FC layers and an ReLU, as shown in Table 5.  Concerning the implementation details of model training, the Adam optimizer [41] with the backpropagation algorithm is adopted to update all trainable parameters of the proposed DAPRN model. During the training procedure, the number of epochs is set to 50. In model training, a decaying learning rate strategy is implemented with the initial learning rate being 0.001 and the decay rate being 0.1. What needs to be pointed out is that the decay epochs are set as 15 and 30, respectively. For the training and testing of the DAPRN model, four commonly used shots in fault diagnosis [42,43], i.e., 1, 3, 5, and 10 shots, are, respectively, set to evaluate the performance of few-shot fault identification, the number of query samples is equal to 200, and the number of episodes, also known as tasks, sampled for training and testing per epoch is set to 100. In the PRS of the meta-testing procedure, the value of Z is set to 20. For clarity, all detailed parameters of the transductive training procedure and meta-testing procedure are summarized in Table 6. To illustrate the efficacy and advantage of the proposed DAPRN, seven popular fewshot learning methods are compared. Notably, all contrastive methods whose feature extractor is built are the same as the feature extractor of the proposed DAPRN for a fair comparison experiment. All details of contrastive methods are illustrated as follows:

1.
Baseline [25]: A two-stage Baseline model consists of pretraining on the base set and fine-tuning on the support set. In the pre-training stage, the Baseline model is composed of a feature extractor and a base class classifier. The novel class classifier composed of stacked FC layers is used to replace the base class classifier during the fine-tuning stage.

2.
BaselinePlus [25]: The BaselinePlus model is the same as the Baseline model except for the novel classifier. In detail, the classifier, which explicitly reduces the intraclass variations by using a cosine-similarity classification structure, is designed to recognize the health conditions of machinery.

3.
SiameseNet [44]: The SiameseNet model, which is composed of a feature extractor and a similarity-measurement module based on a deep neural network, is trained with input being sample pairs of the same or different health conditions of machinery. Note that the sample pairs are randomly selected during the training process. 4.
MAML [43]: The MAML model, which is both agnostic to the structure of the feature extractor and loss function, is a bilevel learning paradigm (i.e., inner loop optimization and outer-loop optimization) for meta-knowledge transfer. In detail, the parameters of MAML are quickly updated by inner-loop and outer-loop optimization.

5.
ProtoNet [27]: The ProtoNet model, which includes a feature extractor and a Euclidean distance-based label predictor, identifies the health conditions of machinery by using the naïve prototypes of the source domain.

6.
MatchingNet [45]: The MatchingNet model, in which the representations of support and query samples are obtained by two independent feature extractors embedded with LSTM, recognizes the health conditions by an attention-based label predictor. 7.
RelationNet [46]: The RelationNet model is built on a feature extractor without the last two max-pooling layers and a deep network for metric learning. Specifically, a two-layer CNN is trained for learning a metric space for few-shot fault diagnosis tasks.
The Baseline and BaselinePlus methods, which are both based on a traditional supervised learning paradigm, are trained with original data and tested on the query set. The rest of the few-shot methods are tested by the standard episodic training paradigm. As a consequence, the SiameseNet, MAML, ProtoNet, MatchingNet, RelationNet, and DAPRN report the average accuracy of all episodes during the meta-testing procedure, whereas the Baseline and BaselinePlus report the accuracy during the testing procedure.

Situation A: Transfer Learning Scenarios with Limited Data
Transfer learning fault diagnosis scenarios with limited data are widespread in real industrial applications. Based on the few-shot setup, the one-shot fault diagnosis problem across different working conditions is addressed and analyzed by few-shot learning methods. As shown in Table 7, six transfer learning tasks (i.e., AB1, AB2, AB3, AG1, AG2, and AG3) on bearing and gearbox datasets are set to validate the effectiveness and superiority of the proposed DAPRN. The tasks AB1, AB2, and AB3 are 10-way one-shot fault identification of bearing, whereas the tasks AG1, AG2, and AG3 are four-way one-shot fault identification of the planetary gearbox. Notably, each of the six tasks is a close-set transfer learning problem with scarce data and target and source domains sharing the same label space. The results of the six one-shot fault diagnosis problems are summarized in Table 8, and the bold value denotes the maximum accuracy on this task. Firstly, it can be seen that the proposed DAPRN achieves the highest one-shot fault diagnosis on all bearing and gearbox tasks, with 89.94% accuracy on task AB2 and 99.98% accuracy on task AG1. Secondly, the BaselinePlus method outperforms the Baseline method on all tasks, which reveals that the cosine similarity-based classifier is more effective in transfer learning scenarios with limited data when compared with a network-based classifier. Thirdly, it can be observed that all few-shot methods except for the ProtoNet show better performance on gearbox tasks AG1, AG2, and AG3 when compared with bearing tasks AB1, AB2, and AB3, which indicates that the four-way one-shot problem is substantially easier than the 10-way one-shot problem in practice. A possible explanation for this phenomenon may be the difficulty degree of identifying support-query tasks in the metric space. The highest accuracy is highlighted in bold.

Situation B: Cross-Domain Few-Shot Fault Diagnosis
To further prove the validity of DAPRN, another six cross-domain few-shot fault diagnosis tasks of bearing and gearbox datasets are set in this part, as shown in Table 9. In tasks BB1, BB2, BB3, and BG1, the setting of few-shot fault diagnosis is the same as that of the FSL scenarios in computer vision, which means that the label spaces of the source and target domains are completely disjoint with each other. In addition, another two tasks, BG2 and BG3, are carried out to imitate open-set fault diagnosis scenarios with scarce data. Experimental results of all few-shot fault diagnosis methods are summarized in Table 10. Similarly, the bold values in this table mean the optimal performances of fault diagnosis on each task, and some interesting and enlightening observations can be obtained. Firstly, compared with contrastive methods, the proposed DAPRN with transductive domain adaptation and PRS achieves the best few-shot fault diagnosis performance on all tasks. For instance, the proposed method, respectively, confirms 100.00% accuracy on bearing task BB3 and 98.65% accuracy on gearbox task BG1, which are higher than the other contrastive few-shot approaches by a large margin. Secondly, it can be observed that the dependency of cross-domain few-shot tasks profoundly affects the performance of all few-shot fault diagnosis methods. Particularly, since the tasks BB1 and BB2 are both three-way one-shot fault diagnosis tasks, most few-shot methods on task BB2 achieve dramatically higher performance than those on BB1. These results may be explained by the fact that the proposed DAPRN attempts to identify three outer race faults (i.e., "O1", "O2", and "O3") on task BB1, whereas the base set does not include similar fault types. Thirdly, it is somewhat surprising that the proposed DAPRN achieves a higher accuracy on task BG1 when compared with task BG2, while most few-shot approaches perform significantly better on open-set fault diagnosis task BG2 rather than those on standard few-shot task BG1. There are several likely causes for the differences but it may be related to the task setting between open-set scenarios and few-shot scenarios. More experiments with 3, 5, and 10-shot settings on bearing task BB1 and gearbox task BG1 are carried out for further analysis; the results are summarized in Figure 9 and some interesting findings can be obtained. What stands out in the figure is that the diagnosis accuracy of most contrastive methods increases with the increase of shots on tasks BB1 and BG1. Still, it can be easily observed that no significant differences in the few-shot fault diagnosis performance of the proposed DAPRN are found between the various shots on all tasks. Ulteriorly, the SiameseNet, MAML, ProtoNet, and DAPRN are capable of accomplishing the fault diagnosis of task BB1 with the 10-shot setting at the same level of accuracy. Meanwhile, on 5-shot and 10-shot BG1, the SiameseNet method confirms similar accuracy when compared with the proposed DAPRN. In conclusion, the results in this section indicate that the proposed DAPRN has excellent performance for few-shot fault diagnosis with different shots owing to the transductive domain adaptation and PRS-based meta-testing process.

The Structure of Feature Extractor
In a few-shot fault diagnosis problem, the representation vectors which are generated by the feature extractor have a great impact on the fault classification performance. To illustrate the relationship between the feature network structure and few-shot fault diagnosis, ten trials of the DAPRN with different feature extractors are repeated on two few-shot fault diagnosis tasks, BB1 and BG1, respectively. Table 4, the feature extractor is composed of four ConvBs. The depth of feature extractor to the fault diagnosis on tasks BB1 and BG1 is described in the box in Figure 10, and the statistical characteristics of ten experiments are described using the 95% confidence interval (CI) and standard deviation (SD). More quantitative results and trainable parameters are summarized in Table 11. In Figure 10a, the fault identification accuracy improves and SD reduces with successive increases in depth of the feature extractor on three-way one-shot task BB1, which denotes that the proposed DAPRN tends to be more effective with increasing trainable parameters of the feature extractor. Likewise, it is easily observed that the few-shot diagnostic results on task two-way one-shot BG1 are somewhat counterintuitive, as shown in Figure 10b and Table 11. Surprisingly, the SD on this task is 0.94% when the depth of the feature extractor is 1, which could be attributed to the fact that the proposed DAPRN with a trained one-layer feature extractor misclassifies most of the query samples, including two fault types (i.e., "PF", and "RF"), as one category. Namely, the one-layer feature extractor with 152 learnable parameters is incapable of performing in few-shot fault diagnosis problems. In summary, the success of the proposed DAPRN could be related to the powerful ability of feature extraction.  perform poorly compared to NoPR models, which shows that the domain discriminator plays a more vital role in few-shot fault diagnosis problems when compared with the prototype recalibration. For further analysis, detailed information on one trial for three models is detailed in Figure 12. As shown in Figure 12a, the diagnostic accuracy of all models converges to a certain degree and the DAPRN confirms the best accuracy when the training process is over. Furthermore, it can be observed that the diagnostic performance of DAPRN is lower than that of NoPR in the early stages of the training process and higher than that of NoPR in the late stages of the training process, implying that the diagnostic performance is only enhanced when the trained label predictor is sufficiently effective. The detailed diagnostic results of one episode at the beginning and end of the transductive training process are shown in Figure 12b, and the kernel density estimation method is applied to fit the probability distribution of detailed results. As can be seen from Figure 12b, the trained DAPRN is capable of outperforming other models even if the diagnostic accuracy of the initial stage is worse than others. In addition to quantitative analysis, the tdistributed stochastic neighbor embedding (t-SNE) [38] is employed for qualitative analysis intuitively in Figure 13. Particularly, the outputs of the feature extractor (i.e., representation vectors) of the NoDA and DAPRN are mapped into a two-dimensional space for clear visualization, respectively. In this figure, the DAPRN gathers the representations of each health condition (i.e., I2, I3, O2, and O3) better than the NoDA, which means that the label predictor of DAPRN has more potential for few-shot fault diagnosis via transductive domain adaptation. Figure 11. Detailed testing accuracy of ten trials using various models on task BB3.  In practical engineering, the hyperparameter Z of PRS of the proposed DAPRN will impact the few-shot fault diagnosis distinctly. For further analysis, the sensitivity experiment on task BB3 was carried out ten times, and the box Figure 14 demonstrates the few-shot fault diagnosis for the hyperparameter Z of PRS. In detail, it can be found that the performance of the proposed DAPRN with Z being 10 is lower than others on task BB3. Moreover, the performance of the DAPRN nearly reaches a plateau and confirms around 94% accuracy when Z values of PRS are more than 20. Consequently, none of these differences in the few-shot fault diagnosis of DAPRN are statistically significant. This finding was unexpected and suggests that blindly increasing the samples of the query set for PRS does not bring significant benefits for few-shot fault diagnosis tasks. A possible explanation for this might be that the recalibrated prototypes for each health condition are limited by the representations of the feature extractor. In other words, with successive increases in the Z values, the enhancement gained by PRS is not capable of continuing to increase due to the representation space with imperfect feature aggregation.

Conclusions
To achieve the issue of few-shot fault diagnosis in real industrial scenarios, an innovative domain-adaptive prototype-recalibrated network (DAPRN) based on transductive domain adaptation and prototype recalibration strategy (PRS) is proposed, comprising a feature extractor, a domain discriminator, and a label predictor. Compared with popular few-shot fault diagnosis methods, the DAPRN jointly conducts source few-shot task classification and domain adaptation across domains, and the meta-knowledge of source diagnostic tasks is generalized to the target tasks through domain adaptation in the transductive training stage effectively. To reduce the bias of the naïve prototypes, the PRS is implemented to promote the few-shot fault diagnosis in the meta-testing stage. Extensive experiments on bearing and gearbox datasets demonstrate outstanding fault diagnosis performance under omnifarious limited data conditions. The quantitative and qualitative analysis convincingly reveals that the proposed method outperforms other few-shot fault diagnosis methods. In terms of real industrial application, the proposed method is promising to address more challenging fault diagnosis scenarios with limited data and promote practical fault identification problems of machinery.
The most important limitations of this study lie in the fact that the proposed method can only perform offline predictions. In addition, a high-quality annotated related dataset is required. In the future, not only meta-knowledge generalization across different machines but also few-shot fault diagnosis with an insufficient base dataset should be undertaken to solve more fault identification tasks in real industrial scenarios.