Collaborative Optimization of CNN and GAN for Bearing Fault Diagnosis under Unbalanced Datasets

Convolutional Neural Network (CNN) has been widely used in bearing fault diagnosis in recent years, and many satisfying results have been reported. However, when the training dataset provided is unbalanced, such as the samples in some fault labels are very limited, the CNN’s performance reduces inevitably. To solve the dataset imbalance problem, a Generative Adversarial Network (GAN) has been preferably adopted for the data generation. In published research studies, GAN only focuses on the overall similarity of generated data to the original measurement. The similarity in the fault characteristics is ignored, which carries more information for the fault diagnosis. To bridge this gap, this paper proposes two modifications for the general GAN. Firstly, a CNN, together with a GAN, and two networks are optimized collaboratively. The GAN provides a more balanced dataset for the CNN, and the CNN outputs the fault diagnosis result as a correction term in the GAN generator’s loss function to improve the GAN’s performance. Secondly, the similarity of the envelope spectrum between the generated data and the original measurement is considered. The envelope spectrum error from the 1st to 5th order of the Fault Characteristic Frequencies (FCF) is taken as another correction in the GAN generator’s loss function. Experimental results show that the bearing fault samples generated by the optimized GAN contain more fault information than the samples produced by the general GAN. Furthermore, after the data augmentation for the unbalanced training sets, the CNN’s accuracy in the fault classification has been significantly improved.


Introduction
As an indispensable component in rotating machines, bearing health status directly affects or even determines the equipment service life. However, in practice, a bearing usually works under extreme and harsh conditions, which makes the bearing prone to faults [1]. Therefore, the timely and accurate fault diagnosis is crucial to reduce the maintenance costs and avoid serious accidents.
In recent years, the data-driven fault diagnosis has been attracting more and more attention from both academia and industry. Among the various data-driven methods, Convolution Neural Network (CNN) and Long Short Term Memory (LSTM) are the most widely used due to their powerful abilities in the complex feature extraction and nonlinear mapping. CNN was first employed in the bearing fault diagnosis by O. Janssens in 2016 [2], and, since then, many improvements have proposed to enhance the CNN's performance, such as 1D-CNN, 2D-CNN, multiscale CNN, and adaptive CNN [3][4][5][6]. Russell Sabir adopted LSTM for the bearing fault diagnosis based on the motor current signal and obtained a classification accuracy of 96% [7]. L. Yu and D. Qiu proposed the stacked LSTM and the bidirectional LSTM, respectively, and both LSTMs obtained an accuracy of more than 99% on the bearing fault diagnosis [8,9]. H. Pan combined 1D-CNN and 1D-LSTM into a unified structure by using the CNN's output into LSTM, achieving a satisfactory test accuracy up to 99.6% [10].
Although many sound results have been reported in the deep learning-based fault diagnosis, there are still many challenges to be solved. For example, all the studies mentioned above assume that there are plenty of high quality data for the deep network training. However, in many applications, the available history or experimental data is very limited or data provided is severely unbalanced. For example, the sample size under some fault classes is extremely smaller compared with the others. Either insufficient or unbalanced data will cause the serious performance reduction of deep networks. According to D. Xiao's work, when the training set samples were reduced from 1000 to 150, the CNN's accuracy declined correspondingly from 97.2% to 83.9% [11]. When the imbalance ratio increased from 2:1 to 40:1, the fault classification accuracy for the outer ring fault based on the GAN-SAE dropped sharply from 97.79% to 20.95% [12].
To address this problem, scholars have proposed diverse methods. Oversampling was first proposed to solve the data imbalance, where the direct replication was used to generate more samples for such labels that had very few ones [13,14]. Although this method is simple and efficient, it easily causes overfitting since no new information is incorporated. As another prospective method for data generation, GAN has been already used for new sample generation in the fault diagnosis. Both W. Zhang and S. Shao employed GAN to learn the mapping between the noise distribution and the actual machinery vibration data to expand available dataset. The results confirmed that the diagnosis accuracy could be improved once the imbalanced data was augmented by GAN [15,16]. However, when building and evaluating the GAN, published research studies only focus on the overall similarity between the generated data and the original one, which inevitably brings problems in the data quality. Small loss function in the general GAN only means that the generated data has a high similarity to the original signal, but it does not guarantee that the generated signal has captured the important characteristics of the original signal. When generating more samples for the unbalanced datasets in the fault diagnosis, it is important to ensure that the generated sample carries the same or nearly the same fault information as the original one, which includes both time and frequency domain characteristics. For this reason, an improved GAN is proposed in this paper and applied to generate samples for an unbalanced experimental dataset which is further used in the CNN-based fault diagnosis.
The main innovations of this paper include: (1) A GAN, together with a CNN, and two networks are optimized in cooperation. The GAN generates a more balanced dataset for the CNN, and the CNN evaluates the quality of the GAN's data generation. Both networks contribute to each other in performance improvement. (2) The fault characterization information is used to improve the general entropy-based loss function in the GAN. The amplitude and frequency errors in the envelope spectrum between the experimental and generated samples are taken as a correction term in the GAN's loss function to enable the GAN to produce samples with higher fidelity and identify more fault information.
The remaining part of this paper is organized as follows. Section 2 details the theory and methodology of the GAN, CNN, and loss function improvement. Section 3 describes the test bench and experimental dataset. Section 4 discusses and analyzes the results. Section 5 concludes the whole paper.

Theory of the GAN
A GAN generates new data without any prior knowledge of the probability density function of original data. It mainly consists of a generator and a discriminator. The discriminator determines whether a sample comes from the original or generated dataset. On the contrary, the generator tries to produce data similar to the original one so that the discriminator can hardly make right decisions. In the general GAN, the loss functions of generator and discriminator are defined as Equations (1) and (2), respectively [15]: where J is the number of real samples, and K is the number of generated samples. x m real represents the data samples coming from the real training dataset, and x i f ake denotes the data samples from GAN generator. D(x m real ) designates the output of discriminator D with the input data sample x m real . Based on the loss function L G and L D , the GAN can be trained as a minmax two-player game until the global optimum, D(x real ) = D(x f ake ) = 0.5, is reached. This indicates that the generated data from the generator is so similar to the real one that the discriminator cannot tell the difference.

Fault Data Generation Based on GAN and CNN
The direct task of a GAN is to generate more samples for the labels with limited measurements. However, the ultimate goal is to improve the data-driven fault-diagnosis method performance when it deals with the imbalanced datasets. Therefore, it is reasonable to take the final fault-diagnosis results into consideration when constructing a GAN so that the data generated can indeed sharpen the algorithm's fault-diagnosis ability. In this paper, to facilitate research, a CNN is selected as a representative of the data-driven fault-diagnosis methods, and the diagnosis task is focused on the fault classification, so its performance is evaluated by the cross-entropy, as shown in Equation (3). The CNN's result is introduced as a correction term in the GAN's generator loss function as formulated in Equation (4): where N is the number of bearing fault types. x i = 1, if the input sample belongs to the bearing fault type i; otherwise, x i = 0. p i is the output of softmax function, which represents the probability that the input data belongs to the bearing fault type i. The formulation for p i is given in Equation (5), and it satisfies ∑ N i=1 p i = 1 [17]. β is a scale factor to keep the loss functions of the GAN and CNN at the same range.

Improvement of Loss Function with Envelope Spectrum
The general GAN can produce data with high similarity to the original measurement, as stated in the last sub-section. In theory, the data fidelity can be even improved when a CNN is employed to collaboratively optimize a GAN. However, until now, all the data points in a sample are treated equally, and the GAN's target is to keep the generated data as similar to the original one as possible. However, in the fault diagnosis, some data points contain more information than others. For example, once a fault occurs on a certain component, such as the outer and inner ring or the balls, the corresponding fault characteristic frequencies (FCF) will appear in the acceleration spectrum. Compared with the overall similarity, the frequency and amplitude at the fault characteristic frequencies contain much more information about the bearing health condition. Therefore, the error of amplitudes and frequencies between the original signal and the generated one at the fault characteristic frequencies is defined as another correction term in the frequency domain as follows: where N denotes the maximum order of FCF, and N = 5 in this study. M i real and M i f ake stand for the i-th order FCF amplitude from the real and generated sample. F i real and F i f ake represent the i-th order FCF frequency from the real and generated sample. In addition, due to different value ranges of frequency and amplitude, in this study, the most widely used normalization method, MinMaxScaler [18], is applied to normalize the amplitudes and frequencies within the 5th-order FCF to the range of [0, 1].
Finally, L f requency is combined with L CNN to construct the final loss function of the GAN's generator. As shown in Equation (7), the sum of L CNN and L f requency is taken as a modification term in the general GAN's loss function L G to ensure the generated data from GAN has a high similarity and captures the important information in detail at the same time. α is a weight factor.
To obtain L f requency , the first step is to calculate the theoretical FCF. The XJTU-SY dataset [19] introduced in the following section includes only three kinds of faults, namely the outer race fault, the inner race fault, and the cage fault. The theoretical FCFs for the aforementioned 3 fault types are the BPFO (Ball Passing Frequency on Outer race), BPFI (Ball Passing Frequency on Inner race), and FTF (Fundamental Train Frequency), respectively. Their formulations are listed as follows [20]: where n is the number of rolling elements, and f s means the shaft frequency. d represents the ball diameter, and D denotes the pitch diameter. α is the bearing contact angle. After calculation of the theoretical FCF, the second step is to capture the actual FCF around corresponding theoretical values. The actual FCF can be affected by many factors, such as the shaft speed, external load, friction coefficient, raceway groove curvature, and the defect size [21,22]. Therefore, there exists bias between the theoretical FCF and the actual FCF in most cases. Besides, some harmonics of FCF influenced by modulation of other vibrations may not be detected in the test bench [22]. Thus, in this paper, the i-th order actual FCF is determined as the maximum peak in the interval of [0.95, where FCF 1st is the first order theoretical FCF, and i is the current frequency order. The actual FCF of both the real measurement sample and generated sample are determined by above two steps. Once actual FCF is identified, the L f requency can be obtained by Equation (6).

Collaborative Training Mechanism of the GAN and CNN
Once the modification for the GAN loss function has been determined, the next step is to train a GAN in cooperation with a CNN. The collaborative training process is demonstrated in Figure 1. Generally, a GAN provides a more balanced dataset for CNN to improve its fault diagnosis accuracy. Whereas CNN evaluates the GAN's generated dataset and outputs its fault classification result as a correction term in the generator's loss function to improve the GAN's data-generation quality, under the collaborative training structure, both CNN and GAN performance can be enhanced. Specifically, as shown in Figure 1, the CNN is firstly built based on the unbalanced dataset, and its classification error is supposed to be high. Meanwhile, the discriminator, as well as the generator, of the GAN are established. Initially, the generator does not work so well, and the generated samples are not so similar to the original ones. The next step is to optimize the CNN and GAN collaboratively. During the optimization process, the GAN's generator learns to generate samples similar to the original signal. The newly generated samples are immediately added to the training dataset of the CNN so that the dataset imbalance can be reduced. When the Nash equilibrium is reached, which is defined as D(x real ) = D(x f ake ) = 0.5, the optimization process stops. Lastly, the GAN's generator is used to extend the original dataset and fine-tune the CNN with the extended dataset. The architecture of the GAN proposed in this paper is detailed in Figure 2. Tables 1 and 2 summarize the hyperparameters of the GAN and CNN, respectively.

Introduction of Bearing Test Bench and Dataset
Experimental data for validation comes from the Xi'an Jiaotong University (XJTU-SY) bearing test bench [19]. As shown in Figure 3, the bearing accelerated life test bench consists of an alternating current induction motor, motor speed controller, supporting shaft, supporting bearing, hydraulic loading system, and test bearing. The test bearing type is LDK UER204, and its basic parameters are summarized in Table 3. The bearing works under 3 different conditions, as specified in the first column of Table 4, where f s stands for the shaft frequency, and F r the radial loading force. Both the axial and radial accelerations are measured at a sampling frequency of 25.6 kHz, and the sampling interval between any two measurements is defined as 1 min, and each sampling lasts for 1.28 s. Under each condition, 5 bearings are tested, such as bearing 1_1-1_5 under condition 1. As each test bearing has a different lifetime, the measurement sample size of each test bearing varies from one to another.   ue to the inherent micro-anisotropy and different working conditions, the lifetime and failure location of the test bearing differ from each other. For a single fault, there are 3 fault types in total, namely the outer race fault, the inner race fault, and the cage fault. Moreover, there are two datasets, bearing 1_5 and bearing 3_2, containing the measurements of compound fault. To simplify the labeling process, only a single fault is considered in this paper. As summarized in Table 4, the number of total samples is large enough for CNN training. However, the dataset is extremely unbalanced. For the most test bearings under all 3 conditions, the failure occurs on the outer ring, with very limited samples on the inner ring and the cage.

Data Preprocessing
The XJTU-SY bearing dataset has recorded the bearing acceleration during the whole life cycle. The test bench runs continuously until the acceleration amplitude exceeds 10 × A normal , which is defined as the failure point. Here, A normal is the maximum amplitude of the horizontal or vertical vibration signals when the bearing runs in the normal operating stage. The fault location in Table 4 stands for position where the fault happens when bearing finally fails. In order to extract the sufficient measurement data for the fault classification while maintaining the correct labels, the signals with acceleration amplitude between 2 × A normal and 10 × A normal are regarded as the fault signals, as shown in  After preparation for the valid source data and labels, the next step is the data preprocessing. At first, the original measurement is denoised by 3-level wavelet decomposition, with Symlet4 as the mother wavelet. After the noise cancellation for the high-frequency components, the data is normalized by z-score. Finally, the normalized data is transformed from 1D to 2D, which means that the acceleration series are sliced into fragments with the same length and then stacked row by row to build a matrix, as illustrated in Figure 5. In each sample, there are a total of 32,768 points of data in each sample. Therefore, the size of 2D matrix is determined as 181 × 181, and the reshaped 2D matrix is fed into GAN and CNN as images. All the work in this study is conducted in MATLAB Deep Network Designer.

Fault Data Generation Based on Optimized GAN
According to Table 5, there are significantly more samples for the outer race fault than for the inner race fault and the cage fault. Consequently, generating more samples for the inner race fault and the cage fault is paramount to reduce the dataset imbalance. It should be noticed that the inner race fault samples consist of data from bearing 2_1, bearing 3_3, and bearing 3_4, while the cage fault samples consist of data from bearing 1_4 and bearing 2_3. This means both the inner race and cage faults have measurement samples collected from different working conditions that define different data distributions. Furthermore, each test bearing has totally different aging dynamics, which can be deduced from their full life cycle trajectories [19]. As a result, the GANs for these datasets need to be trained individually. Bearing 1_4 has only one sample and is, hence, not feasible for the fault diagnosis. In total, 4 GANs need to be established for bearing 2_1, bearing 2_3, bearing 3_3, and bearing 3_4. The data samples generated by a general GAN and an optimized GAN are illustrated in Figure 6 and compared with the original ones after normalization. Specifically, Figure 6(a1) stands for the original signal of a measurement sample from bearing 2_1, Figure 6(a2) is the corresponding sample generated by the general GAN, and Figure 6(a3) shows the sample generated by the optimized GAN. Likewise, Figure 6(b1-b3) are the result for bearing sample 2_3, and Figure 6(c1-c3) for bearing sample 3_3. Take the inner race fault bearing 2_1 as an example; both GANs produce the samples with high similarity to the original ones measured in time domain, and even the peaks are accurately rebuilt. It can be further noticed that the optimized GAN generates a much more accurate peak amplitude than the general GAN. In order to evaluate the GAN's data-generation quality in time domain, every sample is regarded as a vector x ( x ∈ R D ), and every sampling point x i as an element in the vector.
The similarity between the generated sample and the original one can be measured by the angle between two corresponding vectors. Therefore, cosine similarity is adopted as a time domain similarity metric, which is defined as follows: where m and n stand for the acceleration series from the original measurement and the generated sample, respectively, with m = {x 1 , x 2 , · · · , x L } and n = x 1 , x 2 , · · · , x L . | m| and | n| identify the 2-norm of m and n, respectively. The cosine similarity results are summarized in Table 6. For all 3 cases, the sample generated by the optimized GAN has higher cosine similarity to the original one than that produced by the general GAN, which proves the superiority of the optimized GAN in the high-quality data generation. Additionally, the reason why the cosine similarity is relatively small can be explained as the acceleration values change within a big range of [−5, 5], and the signal length is up to 32,761, which means any difference in acceleration amplitude or direction or time lag between counterpart points will bring big accumulative deviation. Besides, the assumption by taking the acceleration signal as 1D vector may not be so feasible when it contains too many elements, which needs further exploration in the future, such as using the Fréchet distance to replace the cosine similarity [23].   Apart from the overall similarity in time domain, the signal characteristics in the frequency domain are the same or even more important for the fault diagnosis. In this study, the envelope spectrum is processed on the original and generated samples. As only the 1st to 5th FCFs are considered in this study, the signal is first filtered by a low-pass filter of 1000 Hz, and then the envelope spectrum is extracted by Hilbert transform and Fast Fourier Transform. The results are displayed in Figures 7-9. Take Figure 7 as an example, which gives the envelope spectrum of bearing 2_1, where the black line is the result of the original measurement, the blue line stands for the sample generated by the general GAN, and the red line symbolizes the sample from the optimized GAN. The theoretical BPFI is also provided by the green dash line. We can find that the envelope spectrum of samples generated by the optimized GAN is similar to the original one, while it appears clearly different from that of the samples generated by the general GAN, especially the amplitudes at the real fault characteristic frequencies. Two locally enlarged views in Figure 7 show that the amplitude from the sample generated by the optimized GAN is much closer to that of the original sample, compared with the sample from the general GAN. The phenomenon is the same for the inner race fault (bearing 3_3), as well as the cage fault (bearing 2_3), which confirms that the optimized GAN can efficiently promote the generated signals to capture more accurate fault characteristics in the frequency domain. As for the other peaks besides fault characteristic ones, especially for the inner race fault, we can find that most of them are caused by the modulation from the shaft frequency and its harmonics, which is consistent with the previous research [24]. Additionally, the deviation between the actual FCF s and the corresponding theoretical values can be explained by many factors, such as the frequency resolution of 0.7814 Hz, the occurrence of rolling element sliding, and the transient contact angles under high external load.   Tables 7-9 summarize the sample frequencies and amplitudes at the corresponding FCF and harmonics, as well as the relative error percentage of these two features between the generated and original samples. The comparison in Table 7 shows that, for all the 1st-5th order BPFIs, the frequencies and amplitudes of samples generated by the optimized GAN are much closer to the original ones than those of samples produced by the general GAN. For the sample generated by the optimized GAN, the frequency error percentage under all five orders of BPFI is zero, while the sample generated by the general GAN cannot fully capture the actual BPFI in the original ones, even though the deviation error is 0.34% and only exists in the 5th order BPFI. However, if we focus on the amplitudes under BPFI, the optimized GAN shows much more superiority over the general one. The amplitude errors under all 5 orders of BPFI from the samples generated by the optimized GAN are much smaller than those from the general GAN. Take the 2nd BPFI as an example; the actual amplitude from the original samples is 0.062, while the corresponding amplitudes of the samples from the general GAN and the optimized GAN are 0.023 and 0.047, respectively. The relative error percentage of amplitude drops from 62.0% to 23.8%. The above analysis confirms that the modification term L f requency in the GAN's generator loss function can enable the GAN to capture the fault information in the frequency domain. The same conclusion can be also drawn based on the results in Tables 8 and 9.  In summary, data generation results show that both the general GAN and the optimized GAN can generate similar samples compared to the original ones. However, the samples generated by the optimized GAN have higher similarity to the original one than that generated by the general GAN, especially at the FCF and harmonics in the frequency domain. More specifically, data generation for one fault type under different working conditions, such as bearing 2_1 and bearing 3_3, proves that the optimized GAN method can be applied to the bearings under the different working conditions. Furthermore, the results of bearing 2_1 (inner race fault) and bearing 2_3 (cage fault) demonstrate that the optimized GAN method adapts to the bearings with different defect types.

Fault Diagnosis Based on CNN_GAN
As introduced in Section 3, there are 648 outer race fault samples, 138 inner race fault samples, and 209 cage fault samples. In other words, the imbalance ratio of XJTU-SY bearing datasets is nearly 5:1:1.5 (outer race fault samples: inner race fault samples: cage fault samples). Besides, 80% of these samples are divided into the training set, with the remaining 20% as the test set. To fully evaluate the positive effect that the GAN has on CNN when dealing with the unbalanced datasets, two more training sets with the imbalance ratios of 10:1:2 and 20:1:2 are built by randomly selecting fewer inner race fault and cage fault samples from the XJTU-SY bearing datasets (the training dataset in Table 5), while the test set is fixed the same as the test set in Table 5. The sample composition of three training sets with different imbalance ratios and the test sets is illustrated in Figure 10. Before validating the test set, on the one hand, CNN is trained on the training sets with the different imbalance ratios, in which the outer race fault has much more samples than the inner race fault and the cage fault. On the other hand, the unbalanced training sets are extended with the optimized GAN by generating more inner race fault and cage fault samples. After data generation, all 3 fault types in the extended training sets have the same sample size, with 518 samples individually. In other words, the ratios between the outer race fault samples, the inner race fault samples, and the cage fault samples become balanced. The general CNN and CNN_GAN mentioned above are validated with the same testing set. The difference between these two CNNs is that the former is trained with the imbalanced training set and then directly validated with the testing set, while the latter is trained with the extended dataset that has been balanced with the collaboration of the GAN and CNN and then validated with the testing set. The CNNs' performance comparison on the testing set is displayed in Table 10. For the general CNN, the fault diagnosis accuracy decreases from 98% to 88% when the imbalance ratio of training set increases from 5:1:1.5 to 10:1:2, and it sharply drops to 68% when the imbalance ratio further raises to 20:1:2. This confirms that the imbalance ratio of training datasets has a great influence on the CNN's performance. On the contrary, if a CNN is trained on the extended datasets that have been augmented with the generated samples from the optimized GAN, the CNN's performance can be significantly improved. For instance, when CNN_GAN is trained with the training sets 1 and 2 that have been extended and balanced, its fault classification accuracy on the testing sets achieves up to 100% and 90%, respectively. Even when the imbalance ratio raises up to 20:1:2, the CNN_GAN's fault classification accuracy still maintains 88%. Under all 3 training sets, the CNN_GAN has a smaller average cross-entropy error compared with the general CNN, which proves that the GAN can efficiently improve the CNN's fault diagnosis performance by generating new samples when dealing with the unbalanced datasets. Additionally, Table 10 shows that a training set with a higher imbalance ratio brings lower CNN classification accuracy, even after being balanced by data generation with a GAN. Though CNN_GAN performs better than CNN, the change tendencies of both two networks over increasing imbalance ratios are consistent, which indicates there exists an imbalance ratio limitation of the training set that CNN_GAN can handle with, especially for a predefined CNN's performance index. For example, in this case, if the target of the CNN's classification accuracy on the fixed imbalanced dataset is set as 90%, then, the CNN_GAN can deal with the training set with a maximum imbalance ratio of 10:1:2.
Besides the accuracy and cross-entropy, the confusion matrix gives more details of the classification for each label. As presented in Table 11, all these 3 cases are validated on the same dataset as the testing set in Figure 10 but trained with one of the three training sets with different imbalance ratios in Figure 10. Specifically, the general CNN is trained with the original unbalanced datasets, and the CNN_GAN is trained with the extended datasets that have been balanced by the optimized GAN. In these confusion matrices, the misclassified samples mainly come from the inner race fault and the cage fault because the outer race fault samples are dominant in each training set. Moreover, the higher the imbalance ratio is, the higher the prediction error is. With further comparison between the CNN and CNN_GAN, it can be found that the CNN_GAN achieves higher overall accuracy than the general CNN. In addition, the fault classification accuracy of both the inner race fault and the cage fault can be improved if the optimized GAN is employed to generate the inner race and cage fault samples. For example, under set 1 and set 2, the CNN's classification accuracy on the inner race fault increases from 85.7% to 100% and from 14.3% to 28.6%, respectively. With respect to the cage fault, the CNN's diagnosis accuracy increases remarkably from 4.8% to 90.5% under set 3. The result can be explained as: in the unbalanced dataset, the dominant fault type samples have much more influence on the loss function, which, therefore, push the CNN forward to extract more local features that are only shared by the dominant fault type, with CNN's ability lost to extract more general and robust features that can distinguish different fault types. This means that CNN has dropped into overfitting. While, for the CNN_GAN, the imbalanced data has been balanced, which means there are no dominant fault types in the training set. Therefore, the trained CNN_GAN can avoid overfitting and have the capability to capture fault features that can be used to recognize the fault types and be simultaneously robust enough. Based on the above analysis, it can be concluded that the balanced training dataset can effectively enhance the CNN's fault classification performance, and the optimized GAN can efficiently transform the unbalanced dataset into the balanced one by generating samples for the fault types that have limited data.

Conclusions
To solve the CNN's performance reduction problem under the unbalanced datasets, an improved GAN is proposed to generate new data for the fault class with limited samples. The work can be summarized as follows: • A collaborative network GAN_CNN is developed. The GAN generates an almost balanced dataset with data augmentation for the inner ring and the cage fault samples.
Once the generated samples are added, the CNN evaluates the extended dataset quality and outputs the fault classification result to modify the loss function of the GAN's generator. • Besides the overall similarity, the similarity on the envelope spectrum is considered when building the GAN. The envelope spectrum error from the 1st-5th order FCF between the experimental data and the generated data is taken as a correction term to the general cross-entropy based loss function of the GAN's generator.
Experimental validation is carried on the XJTU-SY bearing dataset. Results confirm the effectiveness of an optimized GAN and the collaborative structure of the CNN_GAN. The following are the main conclusions: • When constructing the loss function for a GAN, the GAN performance can be improved by considering the envelope spectrum error. The generated samples have higher fidelity and contain more accurate fault information, which, in turn, contribute to the CNN's accuracy improvement. • The collaborative network CNN_GAN performs better than the GAN or the CNN. The GAN generates more accurate data if the CNN's classification results are considered into the GAN's loss function. The CNN's fault classification accuracy can be significantly enhanced after the GAN generates more data for the unbalanced training dataset.
Though only the idea is validated with CNN_GAN in this paper, it can be extended with other methods. For example, the fault characteristic spectrum can be replaced by other metrics characterizing bearing fault status. With regard to the outlook, we will focus on the extension of this method and try to develop a physics-guided GAN. Validation with more experimental data and application cases will also be addressed in the future.  Data Availability Statement: The original experimental data can be downloaded from: http:// biaowang.tech/xjtu-sy-bearing-datasets, and the data samples generated by optimized GAN for this study can be found in the following web-based repository: https://www.dropbox.com/sh/aqtzfb5 14x8hymd/AAB-8cayG5dDsn0z_FFuiNosa?dl=0.