Semi-Supervised Transfer Learning Method for Bearing Fault Diagnosis with Imbalanced Data

: Fault diagnosis is essential for assuring the safety and dependability of rotating machinery systems. Several emerging techniques, especially artiﬁcial intelligence-based technologies, are used to overcome the difﬁculties in this ﬁeld. In most engineering scenarios, machines perform in normal conditions, which implies that fault data may be hard to acquire and limited. Therefore, the data imbalance and the deﬁciency of labels are practical challenges in the fault diagnosis of machinery bearings. Among the mainstream methods, transfer learning-based fault diagnosis is highly effective, as it transfers the results of previous studies and integrates existing resources. The knowledge from the source domain is transferred via Domain Adversarial Training of Neural Networks (DANN) while the dataset of the target domain is partially labeled. A semi-supervised framework based on uncertainty-aware pseudo-label selection (UPS) is adopted in parallel to improve the model performance by utilizing abundant unlabeled data. Through experiments on two bearing datasets, the accuracy of bearing fault classiﬁcation surpassed the independent approaches.


Introduction
With the progress of industrialization, rotating machinery has gradually become of great significance and is widely used in industrial applications. However, the working condition of rotating machinery is quite onerous, which always makes them degenerate and abates the machinery service performance [1]. In detail, a specific fault type, the bearing fault, accounts for almost 30% of the faults in rotating machinery [2]. Although traditional fault diagnosis methods based on engineers' ample experience and domain-specific knowledge have shown good performance, rotating machinery has become increasingly sophisticated in recent years, making diagnosing faults more difficult. Moreover, manual fault diagnosis is laborious and time-consuming. Since intelligent diagnostic methods have emerged recently as one of the most advanced and trendy approaches to solve these issues, resorting to intelligent fault diagnosis is a great choice and worthy of research [3]. For implementing signal analysis-based fault diagnosis in practice, the most generally used method is to extract and classify the main features utilizing data preprocessing and classification algorithms [4].
Many artificial intelligence techniques were applied in practical scenarios of industrial manufacturing, including k-nearest neighbor (K-NN) algorithms [5], Bayesian classifiers [6], support vector machines (SVMs) [7], artificial neural networks (ANNs) [8], and deep learning approaches most recently [9]. Among them, the convolutional neural network (CNN) [10] showed outstanding performance in transfer learning-based fault diagnosis. In 2017, You et al. [11] proposed a CNN combined with support vector regression (SVR) which achieved accuracies of 93.9% and 97.6% for two separate datasets. For most classification models that only use a single dataset, the architecture of a CNN for feature extraction and other artificial intelligence methods for classification can provide high accuracy.
Many successful applications of machine learning algorithms are based on the precondition of a large amount of labeled training data and testing data in the same distribution. The imbalanced and limited data collected from practical systems may lead to low classification accuracies in bearing fault diagnosis for traditional machine learning methods. Meanwhile, the established machine learning model may become unsuitable when dealing with newly acquired data, since such data may not follow the same distribution as the training dataset. Nevertheless, the real-world challenge is the frequent lack of labeling and unhealthy data in bearing fault diagnosis. Under this situation, semi-supervised learning (SSL) can help to alleviate this difficulty by requiring some labeled data [12,13]. SSL is a machine learning task between supervised learning and unsupervised learning. Consistency regularization and pseudo-labeling are two dominant approaches in SSL, and low-density regions of the decision boundaries are a general presumption [14]. Compared with consistency regularization, which often requires numerous augmentation operations, pseudo-labeling can be used in most domains with high accuracy. The key inspiration of SSL is to filter the unlabeled instances with high confidence and use them for training the labeled data for the next iteration.
In some practical applications, even unlabeled data from the same domain can be challenging to obtain. Therefore, transfer learning is a promising technique for overcoming the challenge outlined above, as it is based on transferring knowledge across domains [15,16]. Transfer learning aims to increase model accuracy or reduce the number of labeled samples in the target domain by leveraging knowledge from the source domain [17,18]. In the area of transfer learning-based fault diagnosis, the feature spaces of the source and target domains are usually adopted by the maximum mean discrepancy (MMD) distance [19,20]. According to a review by Pan and Yang [21], the basic approaches to transfer learning can be divided into four categories: instance transfer, feature representation transfer, parameter transfer, and relation knowledge transfer. Moreover, the increasing popularization of deep neural networks prompted researchers to apply them to the subject of transfer learning.
In the beginning, most of the methodologies were based on pretrained recurrent neural networks [22]. When the generative adversarial nets (GAN) approach was first trained for solving transfer problems, it became a hot topic instantly for its remarkable performance. Yaroslav Ganin [23] first introduced an adversarial mechanism into the training of neural networks, known as domain-adversarial neural networks (DANNs). In this study, the learning objective of the network is that the feature generators are designed to help distinguish between the two domains as much as possible while preventing the discriminator from discriminating between the differences in the two domains. In 2019, Yu et al. [24] extended the concept of dynamic distribution adaptation to GAN and presented dynamic adversarial adaptation networks (DAANs) to solve the issue of mismatched contributions of the marginal (global) and conditional (local) distributions between domains. Figure 1 illustrates the different effects of marginal and conditional distributions in transfer learning applications. The marginal distribution influences more when two domains are substantially distinct (Source vs. Target I). In contrast, the conditional distribution should be prioritized when the global distributions are closer (Source vs. Target II). However, to guarantee the success of such domain adaptation methods, there should be abundant labeled data in the domains, which is always impractical in actual working conditions. Collecting enough data also increases the cost of time and effort in fault diagnosis. The data imbalance in diagnosing machinery bearing faults can be outlined in two aspects: the data imbalance between normal and abnormal samples and the insufficient amount of data in settings with different specified external or internal operating parameters. Consequently, this method aims at solving these two problems based on transfer semi-supervised learning. More specifically, semi-supervised learning focuses on the first issue through pseudo-labeling. In contrast, transfer learning addresses the second aspect by transferring knowledge from another different dataset [25,26].
As shown in Figure 2, the traditional pseudo-labeling usually involves feeding a small amount of labeled data into the model for initial training and then feeding the unlabeled data into the model for classification [14]. When the confidence of predicting whether a sample belongs to a class exceeds the predetermined threshold, the sample is given the corresponding pseudo-label. Alternatively, the class in which the maximum confidence of the model prediction belongs is directly selected as the pseudo-label. The pseudo-label is added to the original training dataset as if it is labeled for retraining. However, this approach often suffers from the problem that the pseudo-labels have a high confidence level regardless of whether the samples are correctly labeled or not. Suppose massive unlabeled samples are mislabeled and used for training. In that case, this will result in many noisy samples in the training set, which will affect the performance significantly. It is not sufficient to use the confidence of the softmax layer as the only basis for filtering. Uncertainty-aware pseudo-label selection (UPS) is an effective semi-supervised learning framework that introduced negative learning and uncertainty estimation with expected calibration error (ECE) into conventional pseudo-labeling method [14,28]. Its performance surpasses consistency regularization in many tasks, which is another primary SSL method.
In conclusion, a novel framework, the uncertainty-aware pseudo-label selection (UPS) model with a DANN, is proposed based on the concept of a semi-supervised learninggenerative adversarial network to overcome the mentioned problems of imbalanced data. The main contributions of this paper are as follows:

1.
A hybrid UPS model with a DANN is proposed with a variable ratio to improve accuracy and robustness; 2.
Unlabeled data are labeled with pseudo-labels to enlarge the labeled target dataset; 3.
The proposed method is successfully verified in the analysis of the bearing fault diagnosis task on the Case Western Reserve University (CWRU) dataset and Xi'an Jiaotong University-Sumyoung (XJTU-SY) dataset, where the diagnosis accuracy is proven to be higher than other well-known fault diagnosis methods.
The structure of this paper is as follows. Section 2 introduces the data preprocessing and the proposed method based on UPS and a DANN. Section 3 presents the experiments and illustrates the results by comparing them with independent approaches. Section 4 concludes the paper.

Data Preprocessing
The short-time Fourier transform (STFT) plays a significant role in preprocessing the raw signal data. A Fourier transform is a traditional method to transform the time domain signal into a frequency domain signal. It has a limitation in that it lacks the temporal resolution for the time domain signals. The STFT applies the window and shifts it so that it has a fixed temporal resolution for the time domain signal, which constructs the spectrogram for the subsequent data input. The main formula of an STFT is where x(n) is the discrete signal sequence, w(n) is the analysis window, n 0 is the window center, and ω is a continuous variable-denoting frequency. X(n 0 , ω) is a frequency function of the time section n 0 . Then, the window slides to obtain X(n 0 + s, ω) where s is the hop size and obtains the STFT result of the next section. Lastly, the frequency results are combined in chronological order to form a complete spectrogram. That aside, the window function is a Hann window. The network will be experimented upon through transferring from the CWRU dataset [29] to the XJTU-SY dataset [30]. Figure 3 shows some examples of the CWRU data. Bearing faults in the CWRU dataset are artificially created in different areas of the bearing, and the data are recorded at different sampling rates. Therefore, the vibration figure tends to be regular and periodic in its amplitude along with the time series.
However, the XJTU-SY dataset contains the full life cycle of bearing degeneration. As shown in Figure 4, demonstrating an example of XJTU-SY with obvious transition characteristics, the whole process can be split into three phases by observing the sudden change between them. The first phase is the normal vibration data of the bearing, so the amplitude usually stabilizes within a low-value range. The second phase is the vibration data when the bearing starts to degenerate. During this phase, the amplitude will fluctuate more heavily and sometimes gradually increase over time. The third phase is the vibration data when it is completely damaged. As a result, the amplitude will continue to rise more markedly, eventually reaching a very high level. Nevertheless, for some of the cases shown in Figure 5, the degenerative process is gradual, while in others it may be sharp. The degeneration of the second phase may not be evident and can therefore be ignored, allowing focusing on the first phase and third phase. The data in the first phase are labeled as normal data, and the third phase's data are labeled as fault data.   In conclusion, the two datasets both have commonalities and differences which determine why the transfer from CWRU to XJTU-SY was chosen. The common elements are that they are both bearing vibration data and have some of the same fault classes. Nevertheless, compared with the CWRU dataset, the XJTU-SY dataset is closer to the actual working conditions where bearings will gradually degenerate, but the drawback is the small amount of data. The CWRU dataset, by contrast, contains a larger amount of data, but the data are recorded in a different environment, and the bearing faults are artificially created. Therefore, the CWRU dataset was transferred into the form of the XJTU-SY dataset to solve the data imbalance problem. Table 1 demonstrates the differences between the CWRU and XJTU-SY datasets from six perspectives in detail. The negative learning (NL) proposed by [31] is used mainly to obtain good initialization of the network for learning with noisy labels. In this approach, a network is first trained by randomly generating negative labels (NL step) and then using that network to selectively generate negative labels using confidence scores (SelNL). The selective positive learning (SelPL) they used also relied on creating positive pseudo-labels based on confidence. Different from the NL in [31], the NL in UPS is designed to include additional unlabeled samples into the training phase and generalize pseudo-labeling for multi-label classification settings. In a trained network, the sample x i outputs the probability p (i) , and p (i) c refers to the probability of class c. Similar to one-hot encoding in traditional multi-classification problems, it can be converted to a 1 × C-dimensional label consisting of that class of labels. Therefore, the pseudo-labelsỹ (i) c of sample x (i) are computed as follows: where γ ∈ (0, 1) is the threshold for labels, which is hard to determine. The binary vector represents the pseudo-labels selected as τ p is the confidence threshold for positive labels, and τ n is the confidence threshold for negative ones. Cross-entropy loss is estimated for the samples with selected positive pseudo-labels for single-label classification. When no positive label is selected, negative learning is used with negative cross-entropy loss. The expression of negative learning is where s (i) is the number of selected pseudo-labels for sample i. As a result, even if the model is not confident enough about whether the sample belongs to a class, it can help improve the accuracy of the diagnosis by disproving with a low probability that the sample most probably would not belong to a class.

Uncertainty Estimation
The experiment results show that the ECE score has a positive correlation with the prediction uncertainty, which implies that the model with a lower uncertainty is inclined to have a more significant calibration capability [14]. The uncertainty of the output value can be calculated as an alternative confidence level for selecting reliable pseudo-labeled samples. ECE is a standard metric for evaluating the calibration capability of a classifier, which can be obtained as follows: where the confidence predictions on dataset D are divided into L bins that are evenly spaced, and the samples in a particular bin l are referred to as I l .
Hence, a more reliable subset of pseudo-labels is used in training by considering both the confidence and uncertainty of a network prediction. Now, Equation (2) can be reformulated as where u(p) is the uncertainty of a prediction p while κ p and κ n are the uncertainty thresholds. Figure 6 illustrates the proposed network in this paper: a deep neural network combining a DANN and UPS, where the DANN can support UPS to filter pseudo-labels more robustly and new labels can expand the labeled target data to make the distribution of the source and target domain closer. The DANN model takes all input data for training. The UPS model takes data from the target domain for the pseudo-label selection with uncertainty awareness:

Model Structure
The parameter α is adaptive and can be learned by gradient descent. CE d and CE s are the cross-entropy of the DANN and UPS, and CE pl is the cross-entropy of the pseudolabels. They will be weighted by α before the cross-entropy layer and then passed to select high-confidence instances as the new samples of labeled target data.

Experiment Set-Up
The data from the CWRU dataset is the source-labeled data. The data from the XJTU-SY dataset as target domain consists of only a small amount of labeled data and a large amount of unlabeled data for transfer semi-supervised learning. Each sample of data was split into 240 portions every 2000 data collection spots. The data size in total for the CWRU dataset was 480,000 for each class, and the data size for the XJTU-SY dataset was 672,000. Table 2 shows the selection of data used for training in detail.
The source dataset, created by the Bearing Data Center of Case Western Reserve University (CWRU), is the most widely cited standard dataset for current research on signal processing and fault diagnosis of bearing vibration [29]. It is also considered the primary dataset for training network models and testing network performance. Electro-discharge machining (EDM) was used artificially to induce single-point faults to the test bearings with fault diameters of 7 mils, 14 mils, and 21 mils. Each class of fault diameter was introduced separately at the inner race, ball, and outer race [32]. By changing the bearing diameter, fault location, motor load and speed, and sampling frequency, the experiment generated a variety of valid data in a limited number of practical machines. Considering the balance of data and the common fault element with the CWRU dataset, three fault labels were selected: normal bearings, inner fault bearings, and outer fault bearings (Table 3).
The target XJTU-SY bearing dataset [30] was acquired from accelerated degeneration experiments of rolling element bearings with 15 bearings under 3 operating conditions. Due to the different working conditions of different bearings, the service life of the bearings can also vary significantly, which means the data are highly imbalanced. The fault elements include single and multiple points, specifically the inner race, outer race, and cage for a different bearing lifetime. Based on the different degeneration performances, 8 bearings from the XJTU-SY dataset were selected and split accordingly as the target domain and the same three classes of data from the CWRU dataset as the source domain (Table 4).

Results and Discussion
In general, the overall accuracy of the proposed method in the test dataset can reach up to 99% after 50 epochs (Figure 7), which indicates the ability to transfer the model. The accuracy here is defined as It is interesting to note that the test accuracy was even higher than the training accuracy at the first 10 epochs. Since there was no data leakage in the validation set, and the split of the training-test data set was completely random, it is speculated that the reason for this phenomenon may be that the noise in the training set was greater than that in the validation set. The data augmentation by pseudo-labeling made the training data more complex than the test data, and the model was not able to fully memorize the training data. Table 5 illustrates the performances of the proposed method and other methods. Baseline refers to predicting the test data of the XJTU-SY dataset with the model trained by the CWRU dataset directly. This model has no transfer learning to bring the distributions of the source and target domains into proximity and also no pseudo-labeling to extend the imbalanced training data. Therefore, it is noticed that the baseline only preserved a test accuracy of 23%. When it comes to transferring technologies solely by transferring from the CWRU dataset to the XJTU-SY dataset, UPS and the DANN provided average test accuracies of 42% and 56%, respectively. This performance was twice as high as that of the baseline model, which means that both popular transfer learning models can improve the accuracy considerably at first. When using only semi-supervised learning by using UPS to train the pseudo-labels, the average test accuracy showed an increase of up to 76%. This indicates the remarkable power of uncertainty-aware pseudo-labeling and proves the ability of semi-supervised learning in resolving the problem of a deficiency of labeled data. Ultimately, when combining transfer learning and semi-supervised learning, both methods can improve the accuracy of UPS alone, but the final proposed model, UPS + DANN, showed a greater average test accuracy of 96% compared with UPS + DAAN at 90%. Figure 8 depicts the confusion matrices of six different models, with the rows indicating true labels and columns indicating predicted labels. The percentage of each type of feature is shown in each cell in the confusion matrix. Among the three classes, the precision of the outer race was the highest in most models, and UPS was incompetent at predicting normal data specifically. Moreover, the performance of the DANN was more average for all classes compared with the DAAN. It can also be observed clearly that UPS + DANN preserved a relatively high accuracy, especially in the inner race and outer race classes.

Conclusions
This paper proposes a method based on a DANN and UPS for fault diagnosis of imbalanced machinery bearings. This model combines the advantages of semi-supervised and transfer learning and makes them reinforce each other. Uncertainty-aware pseudo-label selection is used to balance data between labeled and unlabeled. A domain-adversarial neural network complements the target domain via transferring from the source domain. To demonstrate the efficacy of the proposed method, experiments from two different datasets for transfer learning were performed. Compared with the independent approaches, including a DANN, DAAN, and UPS, the outcomes were correspondingly superior. Some further research directions can be undertaken in the future: (1) applying heterogeneous transfer learning to predict the label of the target domain, which never appeared in the source domain, and (2) reducing the proportion of labeled data and testing the robustness of the model repetitively.