Exploiting Generative Adversarial Networks as an Oversampling Method for Fault Diagnosis of an Industrial Robotic Manipulator

: Data-driven machine learning techniques play an important role in fault diagnosis, safety, and maintenance of the industrial robotic manipulator. However, these methods require data that, more often that not, are hard to obtain, especially data collected from fault condition states and, without enough and appropriated (balanced) data, no acceptable performance should be expected


Introduction
For data-driven machine learning techniques [1], it plays an important role in the machinery fault diagnosis and prognostics [2,3]. Recently, deep learning [4] emerges as one that has been progressively adopted to develop health monitoring systems for the machinery. One way of looking at deep learning is as a feature engineering method [5] that automatically extracts features from the collected signals. Propagating these signals from the input layer to layers with less and fewer neurons, the neural network is forced to represent the input data space into a lower dimensional feature space, which, in general, reduces overfitting and increases the accuracy. that the conditional adversarial network is a promising approach for image-to-image translation tasks. Based on the above comment, Zhu et al. [29] presented an approach for learning to translate an image from a source domain to a target domain in the absence of paired examples. Shao et al. [30] develop an auxiliary classifier GAN (ACCGAN) to learn from mechanical sensor signals and generate realistic one-dimensional raw data. Li et al. [31] proposed a novel fault detection method for 3D printers using GAN which consider only normal condition signals with outstanding performance. Mao et al. [32] used a GAN for an unbalanced data driven fault diagnosis of rolling bearings. In addition, for rolling bearings, Jiang et al. [33] proposed a novel anomaly detection approach based on GAN with only health data. Li et al. [34] applied a GAN for the feature space learning in fault diagnosis of 3D printers using only one sample of each faulty state. Wang et al. [35] proposed a method based on a conditional variational auto-encoder and a generative adversarial network for unbalanced fault diagnosis of a planetary gearbox.
To the best of our knowledge, no work reported model based generation of synthetic samples for fault diagnosis of robotic manipulators. Hence, the main contributions of our work are: (1) The application of GAN to generate synthetic examples (signals) representing fault states for mitigating the presence of an unbalanced data set in a fault diagnosis task of an industrial robotic manipulator. More concretely, GAN generates a synthetic wavelet packet transform based feature of a vibrational signal as acquired by an accelerometer; (2) A comprehensive study taking into account six different scenarios for mitigating the unbalanced data, including classical under and oversampling (SMOTE) methods as well as for assessing the effect of factors such as generator selection, the number of training examples in each class, data shuffling in training data, the distribution used for sampling input random data and initial conditions. The rest of this paper is organized as follows. The proposed GAN based fault diagnosis scheme is specified in Section 2. The manipulator experiment was presented in Section 3. The fault diagnosis of the manipulator was analyzed in Section 4 to validate the experiment result. Finally, the conclusions and the future work were detailed in Section 5.

Feature Extraction
Wavelet packet transform (WPT) can be viewed as a time frequency conversion technique of a non-stationary signal [36]. It complements the shortage of wavelet transforms that only decomposes the low frequency components but cannot extract high-resolution on high frequency components. The discrete wavelet transform of the discrete signal f (t) is given by [37]: where m is the scaling factor and n is the sifting factor, which are given, respectively, by: When a 0 = 2, b 0 = 1, (1) can be re-written as In the wavelet transform, the signal u(t) can be separated in the Hilbert space by a scaling and by a wavelet function. The scaling function Φ(t) corresponds to the low frequency part of the original signal while the wavelet function ϕ(k) corresponds to the high frequency part of the original signal with initial conditions: Figure 1 shows a 3-level decomposition of a WPT of a signal u(t). This is decomposed in a high-frequency part h k (t) and in a low-frequency part g k (t). Each part is computed by a filter, i.e., The function using the above filters can be given by: As illustrated in the figure, this procedure can be recursively applied to both low and high-frequency parts. However, the number of decomposition levels will be limited by the actual application. Due to its smoothness and nonlinear characteristics, in this paper, we applied the Daubechies WPT with seven levels (Db7).
We further compute an informative statistics from the WPT as follows [38]: (w m,n (t)) 2 (11) where N denotes the number of data in each node of the 7th decomposition level and w m,n (t) and is given by (1). Hence, a feature vector p can be defined for the signal u(t) as follows: where d is the number of features, i.e., with seven levels d = 2 7 = 128.

Generative Adversarial Network
A Generative adversarial network consists of two models: the generative model G(z) and the discriminative model D(x). The goal of the generative model is to produce synthetic samples such that the discriminate model could not distinguish them from the real samples. At the same time, the objective of the discriminative model is to accept real samples and reject synthetic ones with the highest possible accuracy. In equilibrium, the discriminative model cannot identify the source of the data, meaning that synthetic data are indistinguishable from real data. Figure 2 shows a block diagram of a GAN as used in this work For learning, a GAN implements an adversarial competition between the generator G(z) and the discriminator D(x). Initially, a real sample u(t) is processed by the above described WPT feature extraction technique, and (12) is set as the real input of D(x). A random signal z(n) with a given distribution is input into the generator which in turn produces a synthetic feature vector G(z). The discriminator is trained with a target value of 1 when a real sample is presented at its input, and with 0 for a synthetic example. This process repeats until a Nash equilibrium is reached. In general terms, the GAN optimization problem can be given by: In general terms, the GAN optimization problem can be given by: where max stands for the maximization of the probability of the generator G while min refers to the minimization of the probability in the discriminator D; V(G, D) is the GAN objective function that can be given by: where E x∼P data represents the expectation of real probability distribution, whereas E x∼P G represents the expectation of the random distribution. Since x and z are real-value random variables on the probability space, the expect values of x and z can defined as the integral of x and z, respectively. Therefore, (14) can be re-written as where x P data (x)dx stands for the expectation of E x∼P data (.) and x P G (x)dx denote the expectation of where P data (x) is the distribution of the real data while P G (x) is the distribution of generated data. Whenḟ → 0, D tends to the maximum value, which is given by When P G (x) = 0, D * becomes 1 meaning that the discriminator can effectively recognize synthetic data, when P G (x) is close to P data (x), D * tends to the optimal value of 0.5, which means that synthetic data are indistinguishable from real data. Plugging in (17) into (15), one has: )]dx (18) As the objective of generative part is to shrink the distance between real and generated data, the loss function of generative model can be defined as )]dx (19) Under the above loss and, in general, the training process of a GAN is not stable and gives rise to model collapse. To mitigate this problem, the Wasserstein GAN was proposed where the KL divergence of the classic GAN was replaced by the 1-Wasserstein distance [39]. The Wasserstein GAN loss function is therefore given by: This change in the loss function can make the convergence of the generator faster, but it can be further improved. In [40], an improved Wasserstein GAN was proposed in which an additional gradient penalty was added to (20), i.e., where α is a user-defined scaling factor and λ stands for the gradient penalty coefficient. In another approach, Cabrera et al. [41] proposed a metric aiming at keeping track of the best current generator while training progresses. The metric is given by: where R mean and G mean are the centroids of the real and generated data clusters, respectively, while R std and G std are the real and generated data dispersion (standard deviation), respectively. The smaller (22) the closer the generated data are from real data. In each training step, (22) is computed for the current generator, and the generator exhibiting the lower current distance is viewed as the best current generator. Hereafter, we refer to this process as (model) generator selection.

A Vapnik Loss Inspired GAN
The loss function is a key issue for model selection. Therefore, we are adopting a recently proposed loss function within the GAN and more concretely within the Wasserstein GAN framework. This loss function, first proposed by Vapnik et al. [42], considers the geometry distance between the predict and original data. In brief, the classical loss function used in regression, given a set of N examples, can be given by: where x i and y i are the i-th independent and dependent observation, respectively, h(x) is the regressor hypothesis, and I is the identity matrix. Based on the VC theory, in [42], the identity matrix is replaced by the so-called V-matrix, i.e., (23) becomes: where V is the V-matrix. For data in R d , the V-matrix can be computed for all i, j = 1, . . . , d as where d denotes the number of data dimensions, 0 ≤ xi ≤ c i and c 1 , . . . , c d are non-negative constants. This approach has shown good results in the regression problems. Motivated by both the theoretical background and the experimental results obtained in regression problems, including in the framework of SVR [43], we are proposing to apply this loss function in the framework of GAN as a regularization term in (22), which becomes: where L v is of the form (24).

Random Forests for Fault Classification
Ensemble learning uses a group of algorithms to get a better prediction than any of its base algorithms. A random forest (RF) is a homogeneous ensemble classifier that uses a set of decision trees (DT) [44,45]. Each DT is grown independently using the Bagging technique. In addition, and to increase diversity (reducing the correlation between trees), an RF grows each tree from a random selection of data features. Once trained, an RF uses a majority voting mechanism for making its classification (or regression) prediction.
The CART algorithm is frequently used to grow a decision tree. In the CART algorithm, the Gini index is the metric used for selecting the data set feature to be used in a given node of the tree. Given a node m and the estimated probability p(c|m)(c = 1, 2, 3, . . . , C), the Gini impurity index is defined as: Let n be the splitting point of node m that separates the node into two portions in in which a proportion P a of the samples in m is assigned to m a and a proportion P b is assigned to m b , i.e., P a + P b = 1. Thus, the decrease in the Gini impurity index is defined as follows: The optimal feature j * and the optimal splitting point n * that produce the largest decrease in the Gini impurity corresponds to n * , j * = arg max δG(m, n) The flowchart for building an RF is shown in Figure 3.

Data Generation for Fault Classification
Based on the feature extraction process that uses Wavelet Packet Transform, illustrated in Section 2.1, the GAN network described in Section 2.2 and a random forest classifier detailed in the previous section, one can setup the learning scheme for the manipulator fault diagnoser. As shown in Figure 4, a real data observation of from each class is sent to the WPT to extract the vector of features (12). Meanwhile, a random signal z is input into the generator that will produce a vector of synthetic features G(z).  The goal of the discriminator D(x) is to distinguish between the real vector of features, outputting a 1, and the synthetic vector of features, outputting a 0. The learning process described in Section 2.2 is applied. Once both the generator and discriminator are trained, the generator can be used to generate as many synthetic data as required. Notice that the learning process and subsequent synthetic generation are carried out for each faulty class, i.e., for each class, we need to increase the number of observations. Finally, (health and faulty) real data together with faulty generated data are used in the random forest classifier for fault classification. Based on the above description, the procedure can be outlined in the following flow chart ( Figure 5).  In addition, this whole process is illustrated in Figure 6.  Figure 6. The complete data pipeline for fault diagnosis of the manipulator.

Experimental Test-Rig
Experiments were carried out in the gear shaft system of a Brtirus1510A 6 degrees of freedom industrial manipulator. The gear system, which is the main driver component of the manipulator, consists of two planetary gears and two sun gears. The objective is to monitor the health state of gears by measuring the vibrational signals using an PCB 622B01 accelerometer. The accelerometer was installed on the basis of the sixth axis of the manipulator. See Figure 7 for its exact location. Cracking, pitting, and broken tooth are the main gear fault types in manipulators. Table 1 shows the fault type we are interested in. Figure 8 shows an example of each one of the four types of faults.
The robot is moved by the motors, and the teaching box gives the instructions to the robot to start its next movement. At the beginning of the process, the robot is in its original position of 0 degree. Firstly, it will start back and forth movement from −115 degrees to 140 degrees of the limit range point in the first axis. Secondly, the same movement and the same limit range which is from −50 degrees to 35 degrees. Thirdly, the robot will move from −60 degrees to 90 degrees. Fourthly, the same configuration of movement is from −180 degrees to 180 degrees. Fifthly, the movement range will be decreased such that the range is from −90 degrees to 90 degrees. Finally, the robot will move from −180 degrees to 180 degrees and stop in the original place. This series of dynamic movement is only one experiment process. In the next step, we replace the faulty part in Table 1 to restart the above movement for the next experiment. Finally, the signal in each channel is collected by the NI acquisition system which is an analog-to-digital conversion system that the digital samples are collected with an interface on the laptop.

Collected Vibration Signals
As mentioned above, the experimental measurements include four fault types shown in Table 1. The sampling rate is 100 kHz. The duration of each measurement is 20 s. The sampling interval was set to 0.2 seconds. Thus, 20,000 observations were obtained in each fault type, and 20 k points are chosen for each observations. Therefore, a dataset of 80,000×20 k can be acquired during the experiment. Figure 9 shows an example of a vibration signal acquired in each one of the fault types. For hold-out validation, the data set was divided into two disjoint subsets, the training and the test (sub)sets. The training set has 70% of data while the test set has the remaining 30%. Table 2 shows the number of observations for each one of these sets.

On Different Scenarios for Dealing with the Unbalanced Data Set
Several scenarios were considered to assess the effectiveness of the proposed model. In all these scenarios, we all used an RF with 1000 trees for classification. The different scenarios identified as follows: RF-i denotes a random forest trained with the unbalanced dataset described by Table 2; RF-b2 denotes a random forest trained with a subsampled balanced dataset with only 140 observations per condition; RF-GAN and RF-GAN2 stands rand forests trained with data sets that have the real 14,000 healthy observations and 14,000 faulty observations, the only difference between these scenarios is that RF-GAN uses the technique described in Section 2.2 to select the best current model for generating samples while RF-GAN2 uses the model obtained in the last iteration (i.e., iteration 20,000) of training process to generate the samples. RF-GAN1 is similar to RF-GAN with the difference that only 13,860 synthetic faulty states were generated while the remaining 140 are the original faulty states presented in the training data set. For comparison purposes, we also considered a random forest trained with a data set previously processed by SMOTE, a popular oversampling method for dealing with unbalanced data sets.
In all GAN based models, the generators are multi-layer-perceptrons with a 64:1014:128 fully connected topology while discriminators are also multi-layer-perceptrons with 128:1024:2048:1 fully connected topology. These were selected empirically after some preliminary tests. The Adam optimizer is used with its key parameter settings of β 1 = 0.9, β 2 = 0.999 and = 1 × 10 −8 . The learning rate for the generators is set to 1 × 10 −5 , while, for the discriminators, is set to 1 × 10 −4 . The α and λ are set to 1 × 10 −4 and 1.0, respectively. A maximum number of 10,000 iterations was set for training. Figure 10 shows the distribution of the obtained accuracy for each scenario using boxplots. The results presented in Figure 10 were analyzed by the Friedman test, a non parametric statistic test of hypotheses to evaluate whether or not there is a statistically significant difference between the results (boxplots) of the different scenarios. The Friedman null hypothesis is that there is no statistically significant difference between the results of the different scenarios. Given a significant level, α, this hypothesis cannot be rejected whenever the p Friedman , the p-value generated by the test, satisfies p Friedman > α. The null hypothesis is rejected otherwise, meaning that there is a statistically significant difference between the analyzed scenarios. In such a case, we can detect which of the scenario is responsible for such difference resorting to a pairwise posthoc test. A ranking can be obtained by counting the number of times that a method was a winner in the pairwise comparison. See [46,47] for further details. Here, we are using the usual α = 0.05 and the Wilcoxon test as posthoc.
Applying Friedman to the results in Figure 10 yielded p Friedman = 1.865 × 10 −19 < 0.05, meaning there is a statistically significant difference between the six scenarios. Table 3 shows the subsequent posthoc results. From these, one should conclude that there is no statistically significant difference between scenarios RF-GAN and RF-GAN1 and that these outperform all the others. This is an interesting observation as both RF-GAN and RF-GAN1 use the technique described in Section 2.2 to select the best current generator and that the only difference between these scenarios is that in RF-GAN all the faulty state data are synthetic while in RF-GAN1 only 13,860 faulty examples are synthetic while the remaining 140 are the original faulty states presented in the training data set. This further endorses the quality of the obtained generators.

RF-GAN/RF-GAN1
The average accuracy of RF-i was 87.88% while, with an undersampling balanced dataset, RF-b2 reached 94.5%. RF-GAN had an average accuracy of 97.06%, RF-GAN1 97.75%, and RF-GAN2 95.38%. SMOTE had an averaged accuracy of 95.17% . We observe that the GAN based average accuracies are all higher than the unbalanced (RF-i), undersampling (RF-b2), and oversampling (SMOTE) scenarios. Within the GAN based scenarios, RF-GAN has shown a difference of 1.68% relatively to RF-GAN2 in terms of average accuracy.
Curiously enough, we observed no advantage on the application of (24). For the moment, we keep in mind that (26) is not the only possibility to used the Vapnik-loss in a GAN and that other forms are currently being studied.

On of the Performance in Each Class
To further analyze the above results, the recall indicator of each fault class is studied. As it is shown in Figure 11, the recall indicator in the RF-i model of the health state reaches 100% while, for the other three faulty classes, comes down to 63.88% in gear pitting; 84.3% in gear broken tooth, and 63.35% in gear cracking for an average of 77.88%. This clearly shows the effect of the unbalanced data set. For RF-GAN, the recall indicator in each class is 99.53%, 99.73%, 99.63%, and 99.87%, respectively. In RF-GAN1, the recall indicator in each class is 98.63%, 98.67%, 98.28%, and 95.40%, respectively. For the RF-GAN2 model, the recall indicator in each class is 99.30%, 93.08%, 98.15%, and 90.08%, respectively. The high recall indicators are due to the existence of sufficient examples in each class. This is also visible in the recall of SMOTE.  The fit score is another metric that can be used to compare the relative performance of the different scenarios in each class. From Figure 12, one can see that the performance of RF-i as measured by the F1-score is 70. For completeness, the confusion matrices are presented in Figure 13. All these matrices consider the 6000 test observations as per Table 2. For RF-i (Figure 13a), one can see a high number of misclassifications in non-healthy states due to the unbalanced training set. These misclassifications are strongly reduced (especially in the GAN-based scenarios) when enough data are generated and used for training.

Learning Curves
The performance of the proposed model under different data set sizes is now considered. Figure 14 shows the average performance over 20 independent runs in the testing set for the scenario RF-GAN when only a given percentage of faulty observations are available for training. More concretely, the following percentages were considered i =1, 2,4,6,8,10,20,60,80, and 100 %. For instance, when i = 4%, the number of training examples in each faulty state is 14, 000 × 0.04 = 560.    It can be seen that the average accuracy increased from 56.25% to 97.05% by increasing the availability of faulty data from 1% (140 examples) to 100% (14,000 examples) in each fault type. There was a strong increased in performance up to 20%; after that point, the improvement in accuracy was slower and slower until about 80%. After this value, the improvement was neglectable. That is, adding more data after a certain point hardly improves the performance.

Shuffling Data
When generating training examples from a GAN based model, shuffling the training data are a key important factor for obtaining an acceptable performance. Figure 15 illustrates the importance of data shuffling. The results presented in this figure were obtained with exactly the same RF-GAN configuration, the only difference being the way data are presented to the GAN for training. For non shuffled data, the model is simply not able to generate no matter the other initial conditions (weights).

On the Distribution Used for Sampling Random Inputs
In a GAN based model, the z signal presented to the generator (recall Figure 4) can be drawn from any distribution. However, for this particular application, some distributions are better than others for the training process. Figure 16 shows the classification results for RF-GAN when (a) z is sampled from the standard normal distribution (0 mean and variance 1) and (b) a uniform distribution in [−1,1]. Undoubtedly, the former outperforms the latter.

On the Initial Conditions
GAN is trained using a gradient based method that is sensitive to initial conditions (weights). Figure 17 illustrates the impact of the initial conditions on the fault classification results. As shown in that figure, RF-GAN was able to produce an acceptable accuracy for any of the initial set of weights used. However, and as expected for local optimization methods, in 30 independent runs (initializations), it was possible to identify a particular set of initial random weights that outperformed all the others.

Conclusions
Robotic manipulators are wildly used in the industry and their maintenance and monitoring systems are resorting more and more to data intensive machine learning methods. Methods such as multilayer perceptrons, convolutional neural network, echo state networks, or deep Boltzmann machines have all been used for such endeavors. However, all of these methods rely on a representative, balanced, and large enough training set, which, due to the very nature of (some) faults, is very hard to collect from the equipment. Motivated by the recent success of generative adversarial network (GAN), in this work, we have exploited, for the first time, this type of generative model as an oversampling method for fault classification in an industrial robotic manipulator.
A comprehensive empirical analysis was performed taking into account six different scenarios for mitigating the unbalanced data, including classical under and oversampling (SMOTE) methods. In all of these, a wavelet packet transform combined with GAN is used for feature generation while a random forest is used for fault classification. Studies were also conducted for assessing the sensibility of aspects such as generator selection, the number of training examples in each class, training data shuffling, the distribution used for sampling input random data, and initial conditions.
The main conclusion is that it was possible to increase the performance of the fault diagnoser for an industrial robotic manipulator for any of GAN based models over classical undersampling and oversampling (SMOTE) methods. This is accomplished at the expense of a much higher design and computational effort. Training a GAN is not an easy task due to the model collapse and other factors, and it is certainly a quite time-consuming process. After training a GAN for each fault, one will have a set of generators able to produce as much synthetic data as required in an efficient way though.
Within the GAN based models, those that keep track of the best current generator during training yielded the best results. No statistically significant difference was observed between the scenario that uses exclusively synthetic data for the faulty states and the scenario that uses the available real data for such states. This is yet another piece of evidence on the quality of the obtained GAN generators.
In many cases like prognostics and health management (PHM), enough data can effectively improve the fault monitoring capability of the industrial system. However, it can not completely be executed due to a lack of faulty data. GAN is an efficient tool to get rid of the limitation of data imbalance state, which can enhance the monitoring capability in PHM. Therefore, this approach can provide a data background for PHM. However, this approach still has limitations, one is that this model can only learn the data distribution from a limited faulty data source while there are many kinds of faulty data in the industrial system. For the new faulty data, they need to be sent to this model to learn the new distribution again to generate enough data that will bring time cost for training. Another one is that this approach is trained by sending one single faulty class as an input to GAN in order to obtain a generator. This means we need to train several GAN models for several faulty classes, which is computationally demanding and time consuming.