Partial Transfer Ensemble Learning Framework: A Method for Intelligent Diagnosis of Rotating Machinery Based on an Incomplete Source Domain

Most cross-domain intelligent diagnosis approaches presume that the health states in training datasets are consistent with those in testing. However, it is usually difficult and expensive to collect samples under all failure states during the training stage in actual engineering; this causes the training dataset to be incomplete. These existing methods may not be favorably implemented with an incomplete training dataset. To address this problem, a novel deep-learning-based model called partial transfer ensemble learning framework (PT-ELF) is proposed in this paper. The major procedures of this study consist of three steps. First, the missing health states in the training dataset are supplemented by another dataset. Second, since the training dataset is drawn from two different distributions, a partial transfer mechanism is explored to train a weak global classifier and two partial domain adaptation classifiers. Third, a particular ensemble strategy combines these classifiers with different classification ranges and capabilities to obtain the final diagnosis result. Two case studies are used to validate our method. Results indicate that our method can provide robust diagnosis results based on an incomplete source domain under variable working conditions.


Introduction
Rotating components play a significant role in system performance and are widely applied in engineering machinery such as aerobat, engine, and gearbox systems [1,2]. The failure of rotating components may cause unexpected downtime and economic losses. Therefore, it is crucial to precisely identify and detect the fault states of rotating machinery [3]. Recently, intelligent fault diagnosis has become a hotspot because it can analyze vast amounts of measured data and provide intuitionistic diagnosis results [4].
Intelligent fault diagnosis has received a lot of attention in recent years from both industrial engineers and academic researchers and has accomplished remarkable achievements [5]. For example, shallow machine learning techniques such as support vector machine (SVM) [6] and random forest (RF) [7] have been studied. Deep learning methods have been researched that can adaptively extract the fault features hidden in a collected signal, such as recurrent neural network (RNN) [8], convolutional neural network (CNN) [9], and stack autoencoder (SAE) [10]. In addition, some variant models are being studied, such as dilated CNN [11], CNN with capsule network [12], and multiscale CNN [13]. However, the existing methods are developed based on statistics, which assume that adequate labeled samples are obtainable to train the models. In addition, these methods require the data distribution of training and testing to be identical [14]. In actual industry settings, obtaining a large amount of labeled data is unrealistic. Even if the labeled data can be acquired, the aforementioned methods may fail to recognize the unlabeled data collected

Convolutional Neural Network
A standard CNN usually includes convolution, pooling, fully connected, and outpu layers. In addition, batch normalization operation is usually used in CNN [25]. A convolution layer is combined with a pooling layer to form a convolution block, and This research studies a partial transfer ensemble learning framework (PT-ELF) to solve the above problem. First, two incomplete source domain datasets collected from different components or under different working conditions are defined. Note that neither of them contains all the health states present in the target domain data. They are used to form a complete dataset in which all the health states are included. Then, a weak global classifier based on the complete dataset and two partially strong classifiers based on the deep adversarial network are established. Finally, since the classification ability and classification range of classifiers differ, a particular ensemble strategy is designed to combine these two strong partial classifiers and the weak global classifier, resulting in the final diagnostic results. The main contributions of this research are summarized as follows: (1) A partial transfer ensemble learning framework is designed to diagnose the fault with incomplete training datasets under various conditions; (2) To incorporate the classification ability of multiple classifiers into the PT-ELF model, a particular ensemble strategy is designed to combine a weak global classifier and two partial domain adaptation classifiers; (3) Two case studies using rotor bearing test bench data and motor bearing data are performed to validate and demonstrate the superiority of the proposed method.
The rest of this article is arranged as follows: Section 2 presents the basic theories. The details of the proposed PT-ELF are given in Section 3. Section 4 validates the proposed method and analyzes the results. Finally, the conclusion in Section 5 brings the study to a close.

Convolutional Neural Network
A standard CNN usually includes convolution, pooling, fully connected, and output layers. In addition, batch normalization operation is usually used in CNN [25]. A convolution layer is combined with a pooling layer to form a convolution block, and a deep architecture is built from several such blocks. A Softmax Regression layer usually serves as the last layer and performs regression or classification [26]. In a convolutional layer, the local receptive is adopted, in which only part of the input sample points connect to each node. This operation rapidly decreases the number of parameters and the model complexity. To identify the local features throughout the input sample, weights and biases are shared between the hidden neurons in one convolutional layer [27]. The process in the convolutional layer can be expressed as: where x l−1 k is the k-th node in l − 1 layer. * represents the convolution operation. w l n and b l n represent the weight and the corresponding bias. Additionally, the activation function ϕ(•) is given to transform the convolution layers nonlinearly, which can be denoted as: where c l n represents the k-th nonlinear feature value in l − 1 layer. Sigmoid and ReLU activation functions are commonly used in CNN. Sigmoid can normalize the input data to between 0 and 1. ReLU can enhance the efficiency of the model training and decrease the risk of gradient disappearance [28].
In a pooling layer, the down-sampling operation can decrease the dimension of the features and enhance their robustness. Mathematically, a maximum pooling operation is defined as: where c j represents the j-th location, and the po j is the output of the pooling. For classification tasks, after several convolution blocks and fully connected layers, the Softmax function is usually utilized to predict categories. The loss objective function can be expressed as: where p represents the output probability, and r corresponds to the actual labels.

Deep Adversarial Convolutional Neural Network
Generally, a deep adversarial convolutional neural network (DACNN) consists of a feature extractor G f , a domain discriminator G d , and a classifier G y [29][30][31]. The feature extractor, namely several convolution blocks, serves as a contestant in the DACNN. It can be expressed as G f = G f (x, θ f ), which indicates that the features are extracted from the input sample x with parameters θ f . In addition, a discriminator (binary classifier) is treated as the opponent, which is expressed as Input the source and target samples into the feature extractor, and the output features are further distinguished by the discriminator G d . The binary cross-entropy loss is taken as an objective function, which is described as: where d i denotes the binary variable for x i . Through the adversarial training between two parts, the feature extractor G f tends to extract the common features from the two types of data and makes it hard to differentiate 0 or 1 as the discriminator. Hence, the model can perform well on both the source and target datasets. The loss function is expressed as: where n and N − n represent the sample number of the source and target domain. Additionally, all of the labeled samples should be supervised during training to ensure the accuracy of the diagnosis in the adversarial procedure. Thus, a classifier is established and is expressed as G y = G y (G f (x), θ y ) : R D → R L with parameters θ y , in which L is the number of classes. The cross-entropy loss is applied in the Softmax function and is described as: Adding Equation (7) to the objective function (6), the optimization objective can be expressed as: where L i y (θ f , θ y ) = L(G y (G f (x i )), y i ) and λ is a non-negative hype-parameter trade-off for the losses of the discriminator. In the whole training procedure of the DACNN, the optimization parameters θ f , θ y , θ d can be obtained by:  However, in the testing stage of the actual machine fault diagnosis scenario, all possible health states may appear. Therefore, the target domain dataset includes all health

The Proposed Method
This section describes the proposed method in detail. It mainly includes problem formulation, the training of the three classifiers, and the classifiers' ensemble.

Problem Formulation
Before implementing the proposed method, two incomplete source domain datasets A and B are defined as shown in Figure 3. The source dataset A= {(x A i , y A i )} n A i=1 of n A labels instances associated with |D A | classes and is drawn from distribution P SA . The source of n B labels instances associated with |D B | classes collected from another same-type component and is drawn from distribution P SB . The class label spaces of A and B are denoted as D A and D B , respectively. The collection of different components results in variations in the operating conditions (such as load, speed, etc.) in a real industrial environment; this means that P SA = P SB . In addition, there must be some shared health states contained in both source dataset A and source dataset B, which are denoted as D = D A ∩ D B and shown in Figure 3.D A = D A \D B denotes the private label space of the A andD B = D B \D A denotes the private label sets of B.
The flowchart of the DACNN is displayed in Figure 2. By optimizing Equations (9) and (10), the DACNN tends to train a feature extractor Gf that can extract suitable representations from input samples that can be classified accurately by the classifier Gy but weakens the ability of the discriminator Gd to differentiate which domain this representation is from. In the phases of testing, the domain-insensitive features are extracted by the feature extractor Gf and fed into the health state classifier Gy to identify the states immediately.

The Proposed Method
This section describes the proposed method in detail. It mainly includes problem formulation, the training of the three classifiers, and the classifiers' ensemble.

Problem Formulation
Before implementing the proposed method, two incomplete source domain datasets A and B are defined as shown in Figure 3. The source dataset  However, in the testing stage of the actual machine fault diagnosis scenario, all possible health states may appear. Therefore, the target domain dataset includes all health However, in the testing stage of the actual machine fault diagnosis scenario, all possible health states may appear. Therefore, the target domain dataset includes all health states; it can be expressed as T= {(x T i )} n T i=1 of n T unlabeled instances associated with |D T | classes drawn from distribution P T . The D T represents the label sets of the target domain and D T = D A ∪ D B . In addition, the target domain distribution P T is different in source domain distributions P SA and P SB . This paper aims to establish a fault diagnosis model to realize fault diagnosis based on incomplete source training data under different operating conditions.

Classifier Training
This section describes the training procedure for the three classifiers (weak classifier C W , classifier C A , and classifier C B ) concretely.
First, a complete dataset C that contains all of the classes can be formed based on the incomplete source datasets A and B, as shown in Figure 4. In the complete dataset C, the sample in label spaceD A is from source dataset A, and the sample in label spaceD B is from source dataset B. For the samples in shared label space D, a portion of them come from A, and the rest come from B. Thus, the label space of dataset C is the same as T, and it includes |D T | health states. Second, a standard CNN classifier C W is trained using the complete dataset C. However, since the source domain datasets A and B are collected under various work conditions, the samples in the dataset C are drawn from two types of distributions. In addition, the data distribution in the testing set P T is different in P SA and in P SB . Therefore, the classifier C W has poor classification ability for the target domain data. However, the classifier C W has the ability to classify all health states.

Classifier Training
This section describes the training procedure for the three classifiers (weak classifier CW, classifier CA, and classifier CB) concretely.
First, a complete dataset C that contains all of the classes can be formed based on the incomplete source datasets A and B, as shown in Figure 4. In the complete dataset C, the sample in label space ˆA D is from source dataset A, and the sample in label space ˆB D is from source dataset B. For the samples in shared label space D, a portion of them come from A, and the rest come from B. Thus, the label space of dataset C is the same as T, and it includes |DT| health states. Second, a standard CNN classifier CW is trained using the complete dataset C. However, since the source domain datasets A and B are collected under various work conditions, the samples in the dataset C are drawn from two types of distributions. In addition, the data distribution in the testing set PT is different in PSA and in PSB. Therefore, the classifier CW has poor classification ability for the target domain data. However, the classifier CW has the ability to classify all health states. After the weak classifier CW is obtained, the test samples from the target domain   After the weak classifier C W is obtained, the test samples from the target domain of n T unlabeled instances associated with |D T | classes are classified, and the result is served as a pseudo-label to participate in the subsequent training. Target domain samples whose pseudo-label is in D A are obtained to construct the target domain training set A T . The samples whose pseudo-label is in D B are obtained to construct the target domain training set B T . Thus, the datasets A and A T have the same label space D A , and the datasets B and B T have the same label space D B .
Dataset A and A T have the same health states but draw from different distributions. So, a DACNN model can be trained using the datasets A and A T . A feature extractor and a classifier in this DACNN are combined to form a block, which is taken as classifier C A . The classifier C A is constructed by a DACNN using domain adaptation techniques, so that it has a strong classification ability for the unlabeled target domain dataset. However, the classification range of strong classifier C A is limited to |D A | classes. After the training of classifier C A is completed, classifier C B is trained in the same way. Similarly, the classification range of C B is limited to |D B | classes.
In the implementation process of the DACNN, the SELU activation function is used in convolutional layers; its mathematical expression is expressed as Equation (11): where the value of α is 1.6732, and the value of λ is 1.0507. The SELU activation function can automatically normalize the sample distribution to 0 mean value and unit variance to avoid the gradient exploding or disappearing. The activation function used in the fully connected layer in the state classifier and domain discriminator is ReLU, and it is expressed as Equation (12): In this way, three well-trained classifiers are achieved, including one weak global classifier C W , one strong partial classifier C A , and one strong partial classifier C B . The details of the three classifiers are listed in Table 1. Table 1. Classification range and ability of the three classifiers.

Classifiers
Range of Classification Ability of Classification a classifier in this DACNN are combined to form a block, which is taken as classifier CA The classifier CA is constructed by a DACNN using domain adaptation techniques, so that it has a strong classification ability for the unlabeled target domain dataset. However, the classification range of strong classifier CA is limited to |DA| classes. After the training of classifier CA is completed, classifier CB is trained in the same way. Similarly, the classification range of CB is limited to |DB| classes.
In the implementation process of the DACNN, the SELU activation function is used in convolutional layers; its mathematical expression is expressed as Equation (11): where the value of α is 1.6732, and the value of λ is 1.0507. The SELU activation function can automatically normalize the sample distribution to 0 mean value and unit variance to avoid the gradient exploding or disappearing. The activation function used in the fully connected layer in the state classifier and domain discriminator is ReLU, and it is expressed as Equation (12): In this way, three well-trained classifiers are achieved, including one weak global classifier CW, one strong partial classifier CA, and one strong partial classifier CB. The details of the three classifiers are listed in Table 1. Table 1. Classification range and ability of the three classifiers.

Classifiers
Range of Classification Ability of Classification

Classifiers' Ensemble
After the three classifiers are obtained, this section designs a particular ensemble strategy to combine their results. The procedure for the ensemble strategy is presented in Figure 5.
After inputting a testing sample x into the three classifiers, the classification result yW, yA, and yB can be output from the three classifiers, which can be expressed as: a classifier in this DACNN are combined to form a block, which is taken as classifier CA The classifier CA is constructed by a DACNN using domain adaptation techniques, so tha it has a strong classification ability for the unlabeled target domain dataset. However, the classification range of strong classifier CA is limited to |DA| classes. After the training of classifier CA is completed, classifier CB is trained in the same way. Similarly, the classification range of CB is limited to |DB| classes.
In the implementation process of the DACNN, the SELU activation function is used in convolutional layers; its mathematical expression is expressed as Equation (11): where the value of α is 1.6732, and the value of λ is 1.0507. The SELU activation function can automatically normalize the sample distribution to 0 mean value and unit variance to avoid the gradient exploding or disappearing. The activation function used in the fully connected layer in the state classifier and domain discriminator is ReLU, and it is expressed as Equation (12): In this way, three well-trained classifiers are achieved, including one weak globa classifier CW, one strong partial classifier CA, and one strong partial classifier CB. The details of the three classifiers are listed in Table 1. Table 1. Classification range and ability of the three classifiers.

Classifiers
Range of Classification Ability of Classification

Classifiers' Ensemble
After the three classifiers are obtained, this section designs a particular ensemble strategy to combine their results. The procedure for the ensemble strategy is presented in Figure 5.
After inputting a testing sample x into the three classifiers, the classification resul yW, yA, and yB can be output from the three classifiers, which can be expressed as: The classifier CA is constructed by a DACNN using domain adaptation techniques, so tha it has a strong classification ability for the unlabeled target domain dataset. However, the classification range of strong classifier CA is limited to |DA| classes. After the training o classifier CA is completed, classifier CB is trained in the same way. Similarly, the classification range of CB is limited to |DB| classes.
In the implementation process of the DACNN, the SELU activation function is used in convolutional layers; its mathematical expression is expressed as Equation (11): where the value of α is 1.6732, and the value of λ is 1.0507. The SELU activation function can automatically normalize the sample distribution to 0 mean value and unit variance to avoid the gradient exploding or disappearing. The activation function used in the fully connected layer in the state classifier and domain discriminator is ReLU, and it is expressed as Equation (12): In this way, three well-trained classifiers are achieved, including one weak globa classifier CW, one strong partial classifier CA, and one strong partial classifier CB. The details of the three classifiers are listed in Table 1. Table 1. Classification range and ability of the three classifiers.

Classifiers
Range of Classification Ability of Classification

Classifiers' Ensemble
After the three classifiers are obtained, this section designs a particular ensemble strategy to combine their results. The procedure for the ensemble strategy is presented in Figure 5.
After inputting a testing sample x into the three classifiers, the classification resul yW, yA, and yB can be output from the three classifiers, which can be expressed as:

Classifiers' Ensemble
After the three classifiers are obtained, this section designs a particular ensemble strategy to combine their results. The procedure for the ensemble strategy is presented in Figure 5.
After inputting a testing sample x into the three classifiers, the classification result y W , y A , and y B can be output from the three classifiers, which can be expressed as: If y W = y A y W = y B y A = y B is satisfied, the final result y can be obtained by a majority voting strategy immediately. Otherwise, it means that the results of the three classifiers are different from each other. In such cases, because the classifier C W is a global classifier, y W is served as the reference standard. If y W ∈ D A is satisfied, that means that the actual label of xmay be in D A . In this range, the classifier C A has perfect classification ability, and thus y A is served as the final result. Similarly, if y W ∈ D B is satisfied, y B is served as the final result. However, if y W ∈ D is satisfied, both the classifiers C A and C B have good classification ability in this shared range. In this case, y is determined according to the output probability p in the Softmax layer of classifiers, and it can be expressed as: where the p A , p B , and p W represent the Softmax output probability of classifiers C A , C B , and C W ; max(·) is the maximum function. is satisfied, the final result y can be obtained by a majority voting strategy immediately. Otherwise, it means that the results of the three classifiers are different from each other. In such cases, because the classifier CW is a global classifier, yW is served as the reference standard. If yW ∈ DA is satisfied, that means that the actual label of x may be in DA. In this range, the classifier CA has perfect classification ability, and thus yA is served as the final result. Similarly, if yW ∈ DB is satisfied, yB is served as the final result. However, if yW ∈ D is satisfied, both the classifiers CA and CB have good classification ability in this shared range. In this case, y is determined according to the output probability p in the Softmax layer of classifiers, and it can be expressed as: y y y y y y (14) where the pA, pB, and pW represent the Softmax output probability of classifiers CA, CB, and CW; max(·) is the maximum function.

Architecture of the Proposed Method
The architecture of our method for fault diagnosis is presented in Figure 6, and the process is summarized below.
(1) Collect original vibration signals from different components or under different working conditions, and convert them into frequency domain signals for subsequent model training; (2) Construct a complete dataset by combing these incomplete datasets, and train a weak global classifier CNN; (3) Classify the target domain data using the weak classifier to obtain the two target domain training sets; (4) Train two DACNN models using two source datasets and target domain training sets to construct two strong partial classifiers; (5) Design a particular ensemble strategy to combine the three classifiers and obtain the final classification results.

Architecture of the Proposed Method
The architecture of our method for fault diagnosis is presented in Figure 6, and the process is summarized below.
(1) Collect original vibration signals from different components or under different working conditions, and convert them into frequency domain signals for subsequent model training; (2) Construct a complete dataset by combing these incomplete datasets, and train a weak global classifier CNN; (3) Classify the target domain data using the weak classifier to obtain the two target domain training sets; (4) Train two DACNN models using two source datasets and target domain training sets to construct two strong partial classifiers; (5) Design a particular ensemble strategy to combine the three classifiers and obtain the final classification results.

Experimental Verification
To validate the effectiveness of the proposed PT-ELF method, rotor and rolling bearing experiments are designed. Note that the code for the proposed method is written in Pytorch 1.2 and runs with 16G RAM and a Core I5 10400F CPU.

Rotor Experiment
Case 1 adopts the rotor dataset from Northwestern Polytechnical University. As shown in Figure 7a, the experimental system is composed of a three-phase variable frequency motor, single-span rotor shafting, torque speed sensor, rolling bearing seat, shafting load plate, rubbing mounting bracket, platform bottom plate, radial loading device, coupling, system control cabinet, and fault suite. A displacement sensor is mounted on the rotor test bench to collect vertical vibration signals under a health state and six different fault states as shown in Figure 8, and the sample frequency is 10,240 Hz. Figure 7b depicts the sensor and single-span rotor shaft layout. The structural components are listed in Table 2.

Experimental Verification
To validate the effectiveness of the proposed PT-ELF method, rotor and rolling bearing experiments are designed. Note that the code for the proposed method is written in Pytorch 1.2 and runs with 16G RAM and a Core I5 10400F CPU.

Case 1 4.1.1. Rotor Experiment
Case 1 adopts the rotor dataset from Northwestern Polytechnical University. As shown in Figure 7a, the experimental system is composed of a three-phase variable frequency motor, single-span rotor shafting, torque speed sensor, rolling bearing seat, shafting load plate, rubbing mounting bracket, platform bottom plate, radial loading device, coupling, system control cabinet, and fault suite. A displacement sensor is mounted on the rotor test bench to collect vertical vibration signals under a health state and six different fault states as shown in Figure 8, and the sample frequency is 10,240 Hz. Figure 7b depicts the sensor and single-span rotor shaft layout. The structural components are listed in Table 2.

Experimental Verification
To validate the effectiveness of the proposed PT-ELF method, rotor and rolling bearing experiments are designed. Note that the code for the proposed method is written in Pytorch 1.2 and runs with 16G RAM and a Core I5 10400F CPU.

Rotor Experiment
Case 1 adopts the rotor dataset from Northwestern Polytechnical University. As shown in Figure 7a, the experimental system is composed of a three-phase variable frequency motor, single-span rotor shafting, torque speed sensor, rolling bearing seat, shafting load plate, rubbing mounting bracket, platform bottom plate, radial loading device, coupling, system control cabinet, and fault suite. A displacement sensor is mounted on the rotor test bench to collect vertical vibration signals under a health state and six different fault states as shown in Figure 8, and the sample frequency is 10,240 Hz. Figure 7b depicts the sensor and single-span rotor shaft layout. The structural components are listed in Table 2.  Casing friction support and blade disc 6 Test bearing pedestal 7 Worm gear and worm The rotor vibration data are collected under three working load conditions of 0%, 20%, and 40%. As detailed in Table 3, for each load, data from seven health states (including a health state and six fault states) are used. The data in each state are divided into 300 samples, with 80 randomly selected as tests and the remaining 220 used to train. Each sample, each consisting of 800 data points, is used to verify the method proposed in this paper. Figure 9 shows the waveform of the original displacement signal and the spectral distributions of each health state under 0% load. The left shows the spectral signal, and the right shows the corresponding spectrum. The signals have a large amplitude of around 10-30 Hz, showing relatively similar characteristics, which makes it hard to recognize the health states.   Casing friction support and blade disc 6 Test bearing pedestal 7 Worm gear and worm The rotor vibration data are collected under three working load conditions of 0%, 20%, and 40%. As detailed in Table 3, for each load, data from seven health states (including a health state and six fault states) are used. The data in each state are divided into 300 samples, with 80 randomly selected as tests and the remaining 220 used to train. Each sample, each consisting of 800 data points, is used to verify the method proposed in this paper. Figure 9 shows the waveform of the original displacement signal and the spectral distributions of each health state under 0% load. The left shows the spectral signal, and the right shows the corresponding spectrum. The signals have a large amplitude of around 10-30 Hz, showing relatively similar characteristics, which makes it hard to recognize the health states.

Results and Discussion
In this case study, two incomplete source datasets are constructed, as shown in Table  4. The source dataset A contains five kinds of health states (states 1-5), and the source dataset B contains four kinds of health states (states 4-7).
First, the source domain datasets A and B are mixed to form a training set that contains all health states, which is used to train a weak classifier CW. The classifier CW has a classification ability for all of the health states (seven kinds of health states). Second, according to the classification results (the pseudo-label) of the weak classifier CW on the target domain samples, two transfer models based on a DACNN are trained. They are transferred from source domain dataset A and source domain dataset B to the target domain. Thus, two strong classifiers CA and CB are trained. Finally, after classifying a test sample by the classifiers CA, CB, and CW, three results are obtained and fused by the proposed ensemble strategy described in Section 3.3.

Results and Discussion
In this case study, two incomplete source datasets are constructed, as shown in Table 4. The source dataset A contains five kinds of health states (states 1-5), and the source dataset B contains four kinds of health states (states 4-7).
First, the source domain datasets A and B are mixed to form a training set that contains all health states, which is used to train a weak classifier C W . The classifier C W has a classification ability for all of the health states (seven kinds of health states). Second, according to the classification results (the pseudo-label) of the weak classifier C W on the target domain samples, two transfer models based on a DACNN are trained. They are transferred from source domain dataset A and source domain dataset B to the target domain. Thus, two strong classifiers C A and C B are trained. Finally, after classifying a test sample by the classifiers C A , C B , and C W , three results are obtained and fused by the proposed ensemble strategy described in Section 3.3.
To demonstrate that our method is applicable to various operating conditions, five test scenarios (test scenarios A1-E1) are designed to test the proposed method. As listed in Table 5, the source   The accuracies of the three classifiers (two strong partial classifiers and a weak global classifier) and the proposed PT-ELF method in the five test scenarios are listed in Table 6, and a bar diagram is shown in Figure 10a. Note that the accuracy of C A is tested by states 1-5, and the accuracy of C B is tested using states 4-7. The result of the weak classifier C W and the ensemble result are tested using target domain test data that contain all of the health states (states 1-7). 56.79%, respectively, which are significantly higher than the accuracy of the CNN. This is because the DACNN can extract domain-insensitive features using adversarial training; this restrains the model's performance decrease caused by a distribution discrepancy and further improves the accuracy of the model in the target domain. However, since the source domain A is incomplete, a model (CNN or DACNN) trained by source dataset A is unable to classify the testing samples whose actual label is in ˆB  D (states 6-7). Similarly, a model (CNN or DACNN) trained by source dataset B is unable to classify the testing samples whose actual label is in ˆA D (states 1-3); therefore, the results of methods 1-4 are poor compared to our method. The average accuracy of our method is as high as 90.73%, which indicates that the proposed method has good classification ability for all health states presented in the testing dataset in the target domains.

Method 4 (DACNN Trained by Source B)
The Proposed Method It can be seen from Table 6 that the two strong classifiers C A and C B have high accuracy in the corresponding classification range, with averages of 93.29% and 96.83%. On the one hand, this is because the two strong classifiers are trained by a domain adversarial network DACNN, which can extract domain-insensitive features to classify. On the other hand, they are just tested by partial health states. The result of the weak classifier C W is relatively poor, with an average accuracy of 86.52%. This is because the data of the target domain and two source domains are not uniformly distributed, leading to the decrease in classification performance.
Out of five test scenarios, the result in scenario B1 is the highest at 95.41%; scenario C1 has the lowest accuracy at 83.75%, and the average is 90.73%. This is significantly higher than the weak classifier C W , and maintains a high classification accuracy. This is because the proposed ensemble strategy can cause the test sample to be classified by the corresponding strong classifier as far as possible. It indicates that our method can still achieve good results even under incomplete training data.
In addition, to prove the superiority of our method, relevant methods for a CNN and a DACNN, trained by source dataset A and source dataset B, respectively, are used as comparison methods (Method 1-4). The result is listed in Table 7, and a bar diagram of the various methods is shown in Figure 10b. It can be observed that the average accuracies of the CNN trained by source domains A and B are 58.87% and 55.27%, respectively. The average accuracies of the DACNN trained by source domains A and B are 64.02% and 56.79%, respectively, which are significantly higher than the accuracy of the CNN. This is because the DACNN can extract domain-insensitive features using adversarial training; this restrains the model's performance decrease caused by a distribution discrepancy and further improves the accuracy of the model in the target domain. However, since the source domain A is incomplete, a model (CNN or DACNN) trained by source dataset A is unable to classify the testing samples whose actual label is inD B (states 6-7). Similarly, a model (CNN or DACNN) trained by source dataset B is unable to classify the testing samples whose actual label is inD A (states 1-3); therefore, the results of methods 1-4 are poor compared to our method. The average accuracy of our method is as high as 90.73%, which indicates that the proposed method has good classification ability for all health states presented in the testing dataset in the target domains. The rolling bearing vibration data utilized in case 2 are from Case Western Reserve University [32]. As shown in Figure 11, the setup mainly consists of a loading motor, an induction motor, and testing bearings. The vibration signals used in this case are collected by an accelerometer installed near the drive end. As listed in Table 8, the vibration signals were collected under four different loads (Load 1-Load 4). Each fault was artificially implanted into the bearings with different severity levels from 0.007 to 0.028 inches in diameter (1 inch = 25.4 mm). The details of the test bearing are listed in Table 9. Sensors 2022, 22, x FOR PEER REVIEW 14 of 20 Figure 11. The experiment setup of rolling bearing.   Figure 12), different failure orientations, and different failure severities. As detailed in Table 10, each health state contains 300 samples, which consist of 400 continuous data points. At random, 200 samples are selected to train, and the remaining 100 are used to test. The raw vibration is under 1797 rpm (0 hp) (in the left column), and the corresponding spectral distributions (in the right column) are shown in Figure 13. In terms of raw vibration signals, the health state vibration amplitude is relatively small (Figure 13a). The fault signals (Figure 13b-i) have an obvious impact. The spectral distribution contains the fault frequency and the bearing natural frequency. In addition to the health signals, the other fault vibration signals have a higher amplitude of around 3-4 kHz. It is still very unrealizable to accurately distinguish the fault location, dimension, and orientation across different working conditions with new fault states.    The vibration data collected under four different loads are used to test the proposed method. Each of them includes 12 health states, which include different failure locations (shown in Figure 12), different failure orientations, and different failure severities. As detailed in Table 10, each health state contains 300 samples, which consist of 400 continuous data points. At random, 200 samples are selected to train, and the remaining 100 are used to test. The raw vibration is under 1797 rpm (0 hp) (in the left column), and the corresponding spectral distributions (in the right column) are shown in Figure 13. In terms of raw vibration signals, the health state vibration amplitude is relatively small (Figure 13a). The fault signals (Figure 13b-i) have an obvious impact. The spectral distribution contains the fault frequency and the bearing natural frequency. In addition to the health signals, the other fault vibration signals have a higher amplitude of around 3-4 kHz. It is still very unrealizable to accurately distinguish the fault location, dimension, and orientation across different working conditions with new fault states.  The proposed method mainly studies the case in which only partial health state labeled data are available in the source domain. To verify our method, we assume that source domain dataset A only contains eight kinds of fault state labeled data, while source domain dataset B contains seven kinds of labeled data. Among them, three categories  The proposed method mainly studies the case in which only partial health state labeled data are available in the source domain. To verify our method, we assume that source domain dataset A only contains eight kinds of fault state labeled data, while source domain dataset B contains seven kinds of labeled data. Among them, three categories overlap, as shown in Table 11. In addition, all target domain data are unlabeled; these data contain 12 kinds of health states.

Results and Discussion
Similar to Case 1, the source datasets A and B are first mixed to form a training set containing all health states, and it is used to train the weak classifier C W . Thus, C W has a classification ability for all of the health states, but the classification ability is weak.
In the following step, two DACNN models are trained based on source domain datasets A and B to adapt target domain data. Then, two strong classifiers C A and C B can be obtained. In each DACNN, the feature extractor G f contains two convolution blocks. Meanwhile, the classifier G y contains a fully connected layer and output by a Softmax function. The G y (G f (x)) in the DACNN is taken as the classifier. Finally, three well-trained classifiers C A , C B , and C W with different classification capabilities and classification ranges are integrated using the ensemble strategy introduced in Section 3.3 to obtain the final diagnosis result.  To demonstrate that our method is applicable to different working conditions, five test scenarios (test scenarios A2-E2) with incomplete data are used to test the proposed method, as shown in Table 12. In source dataset A, eight kinds of labeled samples in states 1-8 are available, and in source dataset B, seven kinds of labeled samples in states 6-12 are available. The target data, which contains 12 kinds of unlabeled samples in states 1-12, is used to test. In the five test scenarios, source domain datasets A and B and the target domain dataset are served by data collected under different loads. To indicate the superiority of our method, two conventional deep learning methods based on CNN (method 1 and method 2) and two transfer learning methods based on DACNN (method 3 and method 4) are used for comparison in five test scenarios; the results are listed in Table 13. Method 1 and method 3 are trained using source dataset A, and method 2 and method 4 are trained using source dataset B. In order to show the comparison results visually, the results bar diagram for different methods is shown in Figure 14.

Conclusions
This paper proposes a rotating machinery fault diagnosis method based on partial transfer learning and ensemble learning. Unlike other existing cross-domain diagnostic methods with the assumption of the same health states in the source and target domains, the proposed method can provide a reliable diagnosis result in the target domain even when the source domain is incomplete and only contains partial health states. As the core As shown in Table 13 and Figure 14, the average accuracies of methods 1 and 2 are 64.27% and 57.53%, respectively. The average accuracies of method 3 and method 4, based on transfer learning, are 66.22% and 58.05%, respectively. This is because the DACNN can solve the problem of cross-domain fault diagnosis well and enhances the recognition accuracy in the target domain. However, since the source datasets A and B are incomplete, neither of them contains all the health states presented in the testing data; the fault classification accuracy is still relatively low even if the transfer strategy is used. The accuracy of the method proposed can achieve 98.08%, 95.41%, 99.66%, 99.25%, and 95.83% in five test scenarios, respectively. Accuracy is the lowest in test scenario B2, but it can still remain at 95.41%. In test scenario C2, the classification accuracy is the highest at 99.66%. The comparison results demonstrate that the proposed PT-ELF method exhibits satisfactory cross-domain diagnostic ability with new health states.

Conclusions
This paper proposes a rotating machinery fault diagnosis method based on partial transfer learning and ensemble learning. Unlike other existing cross-domain diagnostic methods with the assumption of the same health states in the source and target domains, the proposed method can provide a reliable diagnosis result in the target domain even when the source domain is incomplete and only contains partial health states. As the core of the proposed method, partial transfer learning can avoid the problem induced by incomplete training data and train two classifiers with strong classification capabilities for partial categories. Then, a particular ensemble strategy is designed to combine the output of the three classifiers (a weak global classifier and two strong partial classifiers). The effectiveness of the proposed method is validated on a rotor experiment and a bearing experiment. After comparing with four related methods, results indicate that the proposed method can achieve superior performance and provide a reliable diagnosis result based on incomplete source domain under various working conditions.
In this preliminary study, the proposed method lies in the assumption that the missing health states in the source domain training set can be obtained from another dataset or another component. The unseen health states will be considered in our future research.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare that they have no known competing financial interest or personal relationship that relate to the work reported in this paper.