Fault Feature Extraction Method of a Permanent Magnet Synchronous Motor Based on VAE-WGAN

This paper focuses on the difficulties that appear when the number of fault samples collected by a permanent magnet synchronous motor is too low and seriously unbalanced compared with the normal data. In order to effectively extract the fault characteristics of the motor and provide the basis for the subsequent fault mechanism and diagnosis method research, a permanent magnet synchronous motor fault feature extraction method based on variational auto-encoder (VAE) and improved generative adversarial network (GAN) is proposed in this paper. The VAE is used to extract fault features, combined with the GAN to extended data samples, and the two-dimensional features are extracted by means of mean and variance for visual analysis to measure the classification effect of the model on the features. Experimental results show that the method has good classification and generation capabilities to effectively extract the fault features of the motor and its accuracy is as high


Introduction
Permanent magnet synchronous motors (PMSMs), which have the advantages of high efficiency, small size, large power density, and wide speed range, have a wide range of applications in production and life. In the process of long-term operation of the motor some faults may occur, such as electrical faults, single faults of demagnetization faults, mechanical faults, and coupling faults in which multiple faults affect each other [1]. These faults make the motor easily damaged during use [2], resulting in economic losses and even casualties. However, data-driven usage has advantages over the maintenance of traditional diagnostic equipment in terms of economy, personal safety, and diagnostic accuracy [3,4]. It only needs to extract data features of faults and adopt a series of analysis methods. With the development of big data technology, deep learning has been widely used in the field of fault diagnosis with its good feature extraction capabilities, but a sufficient and balanced data set is a prerequisite for ensuring the performance of fault diagnosis methods based on deep learning [5]. However, in practical engineering applications, due to the occasional motor faults, the data collected by on-line monitoring equipment show the characteristics of non-stationary, nonlinear, multi-source heterogeneity and low value density, and the amount of effective fault data is lower; in addition, building a motor fault simulation test-bed to collect data samples has certain limitations. First, it is difficult to simulate all fault types and fault degrees, resulting in incomplete samples; second, the motor may be permanently damaged, resulting in high test costs, and may even affect the safety of laboratory personnel and sites. In view of the difficulty that the number of collected fault samples is too low and seriously imbalanced compared with the normal data, it is necessary to find a reliable sample data expansion method to avoid under-fitting or over-fitting.

Variational Auto-Encode Model
VAE is a deep generative model and a powerful learning method applied to fields such as images and language. It is an extension of the auto-encoder and has the same structure as the auto-encoder, consisting of an encoder, a latent space, and a decoder. In addition to these structures, VAE also adds a sampling layer that allows the model to generate new data from the latent space. The structure of the VAE model is shown in Figure 1.
The variational self-coding network constructs the encoder based on the effective loss standard, so that the extracted features obtain a good topology similar to the input data, and the encoded output features are used as the synthesis standard of the decoder, identify the differences between the output and input data according to the appropriate objective function, and shorten the distance between the targets by modifying the weight and bias parameters. The probability distribution of the real fault set obtains the potential expression of the input samples, including mean and variance, through KL divergence learning distribution and coding calculation, so that the generated samples are similar to the real fault samples. The variational self-coding network model is shown in Figure 2. The variational self-coding network constructs the encoder based on the effective loss standard, so that the extracted features obtain a good topology similar to the input data, and the encoded output features are used as the synthesis standard of the decoder, identify the differences between the output and input data according to the appropriate objective function, and shorten the distance between the targets by modifying the weight and bias parameters. The probability distribution of the real fault set obtains the potential expression of the input samples, including mean and variance, through KL divergence learning distribution and coding calculation, so that the generated samples are similar to the real fault samples. The variational self-coding network model is shown in Figure 2. VAE can learn the distribution of data through the encoding process, which is equivalent to mastering the noise distribution corresponding to various data. Therefore, using VAE for data preprocessing can generate the required data through the specific noise obtained by learning. Since it directly calculates the mean square error between the generated data and the original data, without adversarial learning, the use of VAE may make the generated data unrealistic.

Improved Generative Adversarial Network
GAN is an unsupervised probability distribution learning method. Conventional GAN consists of two parts, namely the generator (G) and the discriminator (D), which train and game with each other. GAN mainly adopts the idea of mutual confrontation to improve the quality of the generated data. Random noise z is used as the input training generator to generate fake numbers, so that the discriminator will not recognize them as generated samples. The discriminator takes the real training data and the pseudo samples generated by the generator as input for training to distinguish the generated samples from the real data. The training model of GAN is shown in Figure 3. The variational self-coding network constructs the encoder based on the effective loss standard, so that the extracted features obtain a good topology similar to the input data, and the encoded output features are used as the synthesis standard of the decoder, identify the differences between the output and input data according to the appropriate objective function, and shorten the distance between the targets by modifying the weight and bias parameters. The probability distribution of the real fault set obtains the potential expression of the input samples, including mean and variance, through KL divergence learning distribution and coding calculation, so that the generated samples are similar to the real fault samples. The variational self-coding network model is shown in Figure 2. VAE can learn the distribution of data through the encoding process, which is equivalent to mastering the noise distribution corresponding to various data. Therefore, using VAE for data preprocessing can generate the required data through the specific noise obtained by learning. Since it directly calculates the mean square error between the generated data and the original data, without adversarial learning, the use of VAE may make the generated data unrealistic.

Improved Generative Adversarial Network
GAN is an unsupervised probability distribution learning method. Conventional GAN consists of two parts, namely the generator (G) and the discriminator (D), which train and game with each other. GAN mainly adopts the idea of mutual confrontation to improve the quality of the generated data. Random noise z is used as the input training generator to generate fake numbers, so that the discriminator will not recognize them as generated samples. The discriminator takes the real training data and the pseudo samples generated by the generator as input for training to distinguish the generated samples from the real data. The training model of GAN is shown in Figure 3. VAE can learn the distribution of data through the encoding process, which is equivalent to mastering the noise distribution corresponding to various data. Therefore, using VAE for data preprocessing can generate the required data through the specific noise obtained by learning. Since it directly calculates the mean square error between the generated data and the original data, without adversarial learning, the use of VAE may make the generated data unrealistic.

Improved Generative Adversarial Network
GAN is an unsupervised probability distribution learning method. Conventional GAN consists of two parts, namely the generator (G) and the discriminator (D), which train and game with each other. GAN mainly adopts the idea of mutual confrontation to improve the quality of the generated data. Random noise z is used as the input training generator to generate fake numbers, so that the discriminator will not recognize them as generated samples. The discriminator takes the real training data and the pseudo samples generated by the generator as input for training to distinguish the generated samples from the real data. The training model of GAN is shown in Figure 3.  In general, the JS divergence is usually used to measure the probability distribution distance of the classic GAN model, but when there is no intersection between the real distribution and the generated distribution, the JS divergence cannot obtain stable return gradient information and is difficult to train. An effective method to solve the problem of In general, the JS divergence is usually used to measure the probability distribution distance of the classic GAN model, but when there is no intersection between the real distribution and the generated distribution, the JS divergence cannot obtain stable return gradient information and is difficult to train. An effective method to solve the problem of instability in the training of the adversarial generative network is Wasserstein-GAN (WGAN), that is, training is performed by replacing the JS divergence with the Wasserstein distance.
The Wasserstein distance is used to measure the similarity between two probability distributions, which alleviates the problem of gradient disappearance during GAN training. The loss function value of the WGAN model provides a quantitative standard for the quality of the generated data, and a smaller loss value means that the generated data is more realistic. In addition, when training WGAN, instead of directly balancing the training process of the generator network and the discriminator network, the discriminator network is first optimized until convergence, and then the generator network is updated to make the whole network converge faster.
WGAN is a generative adversarial network that optimizes training by using Wasserstein distance instead of JS divergence. For the real distribution and the model distribution, their Wasserstein distance [16] is: where ∏(pγ, p θ ) is the set of all possible joint distributions with marginal distributions p γ and p θ , when the two distributions do not overlap or overlap very little, and the JS divergence is log2. It does not change with the distance between the two distributions. The Wasserstein distance can still measure the distance between two non-overlapping distributions.
The objective function of WGAN [16] is: Because f (x, ϕ) is an unsaturated function, the gradient of the generated network parameter θ will not disappear, which theoretically solves the problem of unstable training of the original GAN. Additionally, the objective function of the generated network in WGAN is no longer the ratio of the two distributions, which alleviates the problem of model collapse to a certain extent, making the generated samples have diversity. Compared with the original GAN, the last layer of the WGAN evaluation network does not use the sigmoid function, and the loss function does not take the logarithm.

Design of Fault Feature Extraction Scheme Using DATA Expansion Method
The classification model and the expansion model are combined to form a fault diagnosis model with expanded classification effects. Under different iteration times, the mean and variance of the four data sets are compared and expressed in the same coordinate system. By observing and analyzing the convergence effect of the same data set and the classification effect of different data sets, the effectiveness of the WGAN expansion method and the classification effect of VAE are verified. Figure 4 shows the feature classification scheme of turn-to-turn short-circuit fault of PMSM based on the data expansion method. The fault diagnosis steps of turn-to-turn short-circuit of permanent magnet synchronous motor are as follows: 1.
Preprocess the real data to avoid data missing and data imbalance caused by human factors.

2.
Set the coding model, set the number of network layers and the function of each layer.

3.
Set the generation/decoding model, set the number of network layers, and determine the convolution kernel of the convolutional neural network.

4.
Set the discriminant model, set the number of network layers, and select the optimal discriminant function as the classifier.

5.
Input motor fault data and extract real motor data characteristics: mean and variance. 6.
Restore the original data from the mean and variance to ensure that the output data has the same dimensions as the original data. 7.
Use two eigenvalues of mean and variance to measure the classification effect of the improved model, conduct a two-dimensional visualization analysis of the mean, quantify the classification effect, and use accuracy for comparison. 8.
The generated data and the original data of the improved model after training constitute the final data set of turn-to-turn short circuit fault of PMSM. 9.
The expanded data set is analyzed again for dimensionality reduction, and the data distribution at this time is compared with the real sample data distribution to verify the effectiveness of the expansion.
1. Preprocess the real data to avoid data missing and data imbalance caused by human factors. 2. Set the coding model, set the number of network layers and the function of each layer. 3. Set the generation/decoding model, set the number of network layers, and determine the convolution kernel of the convolutional neural network. 4. Set the discriminant model, set the number of network layers, and select the optimal discriminant function as the classifier. 5. Input motor fault data and extract real motor data characteristics: mean and variance. 6. Restore the original data from the mean and variance to ensure that the output data has the same dimensions as the original data. 7. Use two eigenvalues of mean and variance to measure the classification effect of the improved model, conduct a two-dimensional visualization analysis of the mean, quantify the classification effect, and use accuracy for comparison. 8. The generated data and the original data of the improved model after training constitute the final data set of turn-to-turn short circuit fault of PMSM. 9. The expanded data set is analyzed again for dimensionality reduction, and the data distribution at this time is compared with the real sample data distribution to verify the effectiveness of the expansion.

VAE-WGAN Model Structure
For motor fault data, the default value is measured by elements such as square error. Element-based metrics are simple, but they are not suitable for motor data because they cannot model the properties of human visual perception. Therefore, we advocate for the use of higher level and sufficiently constant model representations to measure data similarity. The study found that the purpose of measuring the similarity of samples is achieved through joint training of VAE and GAN. By making the VAE decoder and the GAN generator share parameters and train together, they are combined into one model. More importantly, we use Wasserstein distance instead of JS divergence to solve the problem of gradient disappearance. For the VAE training target, we replace the typical element reconstruction index with the characteristic index expressed in the discriminator.
As shown in Figure 5, the overall framework of the VAE-WGAN model consists of three parts: coding network, decoding/generating network, and discriminating network. The specific function of the decoding/generation network is to restore hidden variables to noise-free data. In the training process, the discriminating network is fixed first, and the decoding network is trained to make the decoded data distribution as close as possible to the original sample distribution. Finally, the probability that the discrimination network can distinguish the decoded data as real data is 1.
through joint training of VAE and GAN. By making the VAE decoder and the GAN generator share parameters and train together, they are combined into one model. More importantly, we use Wasserstein distance instead of JS divergence to solve the problem of gradient disappearance. For the VAE training target, we replace the typical element reconstruction index with the characteristic index expressed in the discriminator.
As shown in Figure 5, the overall framework of the VAE-WGAN model consists of three parts: coding network, decoding/generating network, and discriminating network. The specific function of the decoding/generation network is to restore hidden variables to noise-free data. In the training process, the discriminating network is fixed first, and the decoding network is trained to make the decoded data distribution as close as possible to the original sample distribution. Finally, the probability that the discrimination network can distinguish the decoded data as real data is 1. VAE-WGAN combines the two together and shares the decoder/generator to realize data generation. In the training of this structure, the discriminator provides the restriction of the true and false data, so that the reconstruction of the VAE can produce more real data. As shown in Figure 6, the z obtained in the VAE makes the input of the generator not only a completely random vector, but also encoded by real data, which is equivalent to an additional piece of supervision information (originally, the real data generated cannot be seen, so it can only learn to generate data through the output of the discriminator, and the output of the discriminator is only a scalar, which is very difficult to guide the generation of high-dimensional vectors). The final model combines the advantages of GAN as a high-quality generative model and the advantages of VAE as a method to generate the data encoder into the latent space z. Therefore, VAE-WGAN uses VAE as the generator of WGAN. Such a network not only has the characteristics of controllable data generation of VAE but also has the excellent data generation performance of WGAN. VAE-WGAN combines the two together and shares the decoder/generator to realize data generation. In the training of this structure, the discriminator provides the restriction of the true and false data, so that the reconstruction of the VAE can produce more real data. As shown in Figure 6, the z obtained in the VAE makes the input of the generator not only a completely random vector, but also encoded by real data, which is equivalent to an additional piece of supervision information (originally, the real data generated cannot be seen, so it can only learn to generate data through the output of the discriminator, and the output of the discriminator is only a scalar, which is very difficult to guide the generation of high-dimensional vectors). The final model combines the advantages of GAN as a high-quality generative model and the advantages of VAE as a method to generate the data encoder into the latent space z. Therefore, VAE-WGAN uses VAE as the generator of WGAN. Such a network not only has the characteristics of controllable data generation of VAE but also has the excellent data generation performance of WGAN.   Figure 7 illustrates the detailed architecture of the encoder, decoder, and discriminator of the VAE-WGAN model proposed in this paper. The encoder, decoder, and discriminator share the same architecture. For the training experiment in this section, the model uses a convolution structure, the convolution kernel size is 3, and the post-convolution with a stride of 2 is used to amplify the data of the discriminator. The filling functions are all SAME to ensure that the data size dimensions are consistent, and the data is normalized before the activation function is activated. Post-convolution is realized by turning the direction of convolution. The model is trained using RMSProp with a learning rate of 0.0003 and a step size of 64.  Figure 7 illustrates the detailed architecture of the encoder, decoder, and discriminator of the VAE-WGAN model proposed in this paper. The encoder, decoder, and discriminator share the same architecture. For the training experiment in this section, the model uses a convolution structure, the convolution kernel size is 3, and the post-convolution with a stride of 2 is used to amplify the data of the discriminator. The filling functions are all SAME to ensure that the data size dimensions are consistent, and the data is normalized before the activation function is activated. Post-convolution is realized by turning the direction of convolution. The model is trained using RMSProp with a learning rate of 0.0003 and a step size of 64. inator share the same architecture. For the training experiment in this section, the model uses a convolution structure, the convolution kernel size is 3, and the post-convolution with a stride of 2 is used to amplify the data of the discriminator. The filling functions are all SAME to ensure that the data size dimensions are consistent, and the data is normalized before the activation function is activated. Post-convolution is realized by turning the direction of convolution. The model is trained using RMSProp with a learning rate of 0.0003 and a step size of 64. The encoder uses four layers of convolutional layers to balance the number of extracted data features and the complexity of the network structure. The flattening layer is used to convert the multi-dimensional data generated by the convolution layer into onedimensional data used by the full connection layer. The full connection layer is used to output the mean and variance of the posterior distribution of hidden variables. In addition to the flattening layer, all coding networks use LeakyReLU function as the activation function and all perform batch normalization (BN). These can make the network output of each layer sparser, increase the non-linearity of the whole network, prevent gradient explosion or gradient disappearance, and accelerate the convergence speed of the network. As shown in Figure 6, the coding network of VAE-WGAN consists of six layers, and the first four layers have 16, 32, 64, and 128 convolution kernels, respectively. All convolutions use the LeakyReLU activation function to transfer the flattened data to the latent space. The encoder uses four layers of convolutional layers to balance the number of extracted data features and the complexity of the network structure. The flattening layer is used to convert the multi-dimensional data generated by the convolution layer into one-dimensional data used by the full connection layer. The full connection layer is used to output the mean and variance of the posterior distribution of hidden variables. In addition to the flattening layer, all coding networks use LeakyReLU function as the activation function and all perform batch normalization (BN). These can make the network output of each layer sparser, increase the non-linearity of the whole network, prevent gradient explosion or gradient disappearance, and accelerate the convergence speed of the network. As shown in Figure 6, the coding network of VAE-WGAN consists of six layers, and the first four layers have 16, 32, 64, and 128 convolution kernels, respectively. All convolutions use the LeakyReLU activation function to transfer the flattened data to the latent space.
The decoding network consists of six layers. The first layer is the full connection layer, including 2048 neurons; the second layer is the remodeling layer, which first reshapes the one-dimensional feature vector output by full connection to 2 × 2 × 512, then the feature map is normalized in batches, and finally the normalized results are transformed non-linearly by ReLU activation function; the third to sixth layers are deconvolutional layers, with 256, 128, 64, and 32 convolution kernels, respectively. The activation function of the first five convolution layers is ReLU, and the last convolution layer uses Tanh activation function.
The discriminating network has five layers in total. The first four layers are convolution layers, with 32, 64, 128, and 256 convolution kernels, respectively, and the activation functions are LeakyReLU; the last layer is the reshaping layer, which first reshapes the input multi-dimensional data into one-dimensional data, then linearly maps the reshaped results, and finally activates the linear mapping results with Sigmoid function to output the probability of judging whether the network input data is real data. The discriminating network is a two-classifier of true and false, the structure is relatively simple, and the output probability represents the authenticity of the data.

Motor Fault Parameter Acquisition
The performance parameters of the automotive PMSM used in this paper are as follows: rated power of 12 kW, rated speed of 1500 r/min, number of motor poles of 10, number of stator slots of 45, and cooling mode of water cooling. Since a rich and diverse data set can improve the learning ability of a deep neural network and avoid the over-fitting phenomenon, this paper adopted a combination of features to combine the irrelevant feature terms A-phase current, B-phase current, electromagnetic torque, and negative sequence current into a four-dimensional fault sample set.
In this experiment, 5000 groups of real motor fault samples were divided into a training set and test set in a ratio of 3:1. In total, 100 groups were sampled in each batch, with 2000 and 5000 iterations. As the number of iterations increased, the effectiveness of the model was analyzed from the following three aspects: 1.
The feature learning ability of VAE network changes.

2.
Time domain correlation analysis of generated data.

3.
Validity analysis of samples generated after data expansion.
This paper studies the influence of changing the number of iterations on the feature learning ability of an encoder. The dimension of the training set is reduced through the encoder to extract the final required two-dimensional hidden layer features (the features include mean and variance, and the two-dimensional mean is used as the standard in the experiment) to show the good sample reconstruction characteristics of hidden variables. The experimental analysis of the distribution of the encoder data features mapped to the two-dimensional potential space was conducted after dimension reduction.  Figures 8 and 9 represent the mean of the hidden layer characteristics of the negative sequence current in the state of inter-turn short circuit, and the black dots represent the mean of the hidden layer characteristics of the electromagnetic torque in the state of inter-turn short circuit. When the VAE model is iterated 2000 times, the data has no obvious signs of convergence, there is no obvious concentration area and trend, and the classification effect is poor. The model with 5000 iterations is shown in Figure 9. Compared with the data in Figure 8, it has better convergence and classification effects, and has a more obvious characteristic boundary.

Effectiveness Analysis of VAE Classification
In Figures 10 and 11, the red dots represent the hidden layer characteristic variance of the negative sequence current in the inter-turn short circuit state, and the black dots represent the hidden layer characteristic variance of the electromagnetic torque in the inter-turn short circuit state. Figure 10 shows a very obvious classification effect, but the data is less centralized and the features are more scattered. As the number of iterations in Figure 11 increases, the aggregation and convergence of the data are also improved to a certain extent, and the classification effect is also obvious.
The comparison of Figures 8-11 shows that the distribution rules of the two features are consistent. The experiment used the two features of mean and variance to characterize the data expansion ability and classification ability of VAE. The variational auto-encoding network was iterated 2000 and 5000 times, respectively. The comparison between Figures 9 and 11 shows that the greater the number of iterations, the more mature the VAE model training is, and the better the data expansion and classification capabilities are. Experiments show that the VAE model has a strong feature classification ability, but the data expansion ability is unstable and the effect is not obvious. obvious concentration area and trend, and the classification effect is poor. The model with 5000 iterations is shown in Figure 9. Compared with the data in Figure 8, it has better convergence and classification effects, and has a more obvious characteristic boundary.  In Figures 10 and 11, the red dots represent the hidden layer characteristic variance of the negative sequence current in the inter-turn short circuit state, and the black dots represent the hidden layer characteristic variance of the electromagnetic torque in the inter-turn short circuit state. Figure 10 shows a very obvious classification effect, but the data is less centralized and the features are more scattered. As the number of iterations in Figure 11 increases, the aggregation and convergence of the data are also improved to a certain extent, and the classification effect is also obvious. obvious concentration area and trend, and the classification effect is poor. The model with 5000 iterations is shown in Figure 9. Compared with the data in Figure 8, it has better convergence and classification effects, and has a more obvious characteristic boundary.  In Figures 10 and 11, the red dots represent the hidden layer characteristic variance of the negative sequence current in the inter-turn short circuit state, and the black dots represent the hidden layer characteristic variance of the electromagnetic torque in the inter-turn short circuit state. Figure 10 shows a very obvious classification effect, but the data is less centralized and the features are more scattered. As the number of iterations in Figure 11 increases, the aggregation and convergence of the data are also improved to a certain extent, and the classification effect is also obvious.    The comparison of Figures 8-11 shows that the distribution rules of the two features are consistent. The experiment used the two features of mean and variance to characterize the data expansion ability and classification ability of VAE. The variational auto-encoding network was iterated 2000 and 5000 times, respectively. The comparison between Figures  9 and 11 shows that the greater the number of iterations, the more mature the VAE model training is, and the better the data expansion and classification capabilities are. Experiments show that the VAE model has a strong feature classification ability, but the data expansion ability is unstable and the effect is not obvious.

Time Domain Correlation Analysis of Generated Data
As shown in Figure 12, the time domain diagram of A phase current fault data and real fault data was enhanced by three methods in the same coordinate system. There is a large deviation near 4.45 s of the data expanded by GAN. The data generated by polynomial fitting method can describe the trend of the curve as a whole but lacks local data features. The data expanded by WGAN can better fit the original data in terms of data

Time Domain Correlation Analysis of Generated Data
As shown in Figure 12, the time domain diagram of A phase current fault data and real fault data was enhanced by three methods in the same coordinate system. There is a large deviation near 4.45 s of the data expanded by GAN. The data generated by polynomial fitting method can describe the trend of the curve as a whole but lacks local data features. The data expanded by WGAN can better fit the original data in terms of data trend and size. Figure 13 shows the CORREL function values of the three data enhancement methods, indicating the correlation between the expanded data and the original data. It can be seen from the figure that the polynomial fitting method has the lowest data correlation after expansion. The expanded data of WGAN has the highest correlation and can best express the characteristics of the original data. Figure 14 shows the original data and three types of expansion data of B phase current. The data expanded by GAN loses the characteristics of the original data from 4.4 to 4.48 s. Because the current tends to drop sharply near 4.42 s, the data fitted by polynomial cannot express the characteristics of the original data better, but the data expanded by WGAN can still maintain a good fit with the original data. Figure 15 shows the CORREL function values of the three expansion methods for B phase current. The WGAN fitting degree is the best, and the CORREL function value is 0.986, followed by GAN and polynomial fitting. trend and size. Figure 13 shows the CORREL function values of the three data enhancement methods, indicating the correlation between the expanded data and the original data. It can be seen from the figure that the polynomial fitting method has the lowest data correlation after expansion. The expanded data of WGAN has the highest correlation and can best express the characteristics of the original data. Figure 12. A phase current expansion data comparison. Figure 12. A phase current expansion data comparison.   Figure 14 shows the original data and three types of expansion data of B phase current. The data expanded by GAN loses the characteristics of the original data from 4.4 to 4.48 s. Because the current tends to drop sharply near 4.42 s, the data fitted by polynomial cannot express the characteristics of the original data better, but the data expanded by WGAN can still maintain a good fit with the original data. Figure 15 shows the CORREL function values of the three expansion methods for B phase current. The WGAN fitting degree is the best, and the CORREL function value is 0.986, followed by GAN and polynomial fitting.   Figure 16 shows the original data of electromagnetic torque and the three types of expansion data. The data expanded by GAN has a data jump after 4.55 s, which deviates   Figure 16 shows the original data of electromagnetic torque and the three types of expansion data. The data expanded by GAN has a data jump after 4.55 s, which deviates from the original data. The data using polynomial fitting has a large amount of data around 4.5 s which cannot fit the original data well. The data generated by the WGAN Figure 15. Data correlation comparison of expansion methods. Figure 16 shows the original data of electromagnetic torque and the three types of expansion data. The data expanded by GAN has a data jump after 4.55 s, which deviates from the original data. The data using polynomial fitting has a large amount of data around 4.5 s which cannot fit the original data well. The data generated by the WGAN expansion method has a good fit with the original data as a whole and there is no partial deviation of the data. At the same time, it can be seen from the Figure 17 that the calculated value of the CORREL function between the data generated by WGAN and the original data is 0.9956. It shows that the data generated by this method has a higher correlation with the original data.   Figure 18 shows the original time domain diagram of the negative sequence current. The data after GAN expansion deviates around 4.45, 4.55, and 4.57 s, and the data is unstable. The data using polynomial fitting completely loses the original data characteristics from 4.50 to 4.57 s, and the generated data loses credibility. The data expanded using WGAN maintains a very good fit with the original data. It can be seen from Figure 19 that the CORREL function value of the data generated using the WGAN expansion method is as high as 0.99939. Therefore, compared with GAN and polynomial fitting method, the data generated by WGAN has a better fit with the original data.   Figure 18 shows the original time domain diagram of the negative sequence current. The data after GAN expansion deviates around 4.45, 4.55, and 4.57 s, and the data is unstable. The data using polynomial fitting completely loses the original data characteristics from 4.50 to 4.57 s, and the generated data loses credibility. The data expanded using WGAN maintains a very good fit with the original data. It can be seen from Figure 19 that the CORREL function value of the data generated using the WGAN expansion method is as high as 0.99939. Therefore, compared with GAN and polynomial fitting method, the data generated by WGAN has a better fit with the original data.  Figure 18 shows the original time domain diagram of the negative sequence current. The data after GAN expansion deviates around 4.45, 4.55, and 4.57 s, and the data is unstable. The data using polynomial fitting completely loses the original data characteristics from 4.50 to 4.57 s, and the generated data loses credibility. The data expanded using WGAN maintains a very good fit with the original data. It can be seen from Figure 19 that the CORREL function value of the data generated using the WGAN expansion method is as high as 0.99939. Therefore, compared with GAN and polynomial fitting method, the data generated by WGAN has a better fit with the original data.  Through experimental verification, the data samples generated by GAN and WGAN can meet the characteristics of real samples. However, the GAN generation model actually only learns real samples, which is one-sided and cannot represent the entire sample. GAN has a better expansion effect on core data, and the expanded data is only distributed in part of the real data. It has the advantage of concentration, but the data is not representative. The samples of WGAN not only reflect the overall advantage of change, but also the data completely contains real samples, which proves the effectiveness of WGAN model data expansion and provides an effective solution for the lack of motor fault data and imbalanced fault data.
As shown in Table 1, the data samples generated by both GAN and WGAN can conform to the characteristics of real samples through experimental verification, but the GAN generation model actually only learns real samples, which is one-sided and cannot represent the whole sample. GAN has a better expansion effect on core data, and the expanded data is only distributed in part of the real data, which has the advantage of centralization, but the data is not representative. The samples of WGAN not only reflect the overall change advantage, but also the data completely contains the real samples. It proves the  Through experimental verification, the data samples generated by GAN and WGAN can meet the characteristics of real samples. However, the GAN generation model actually only learns real samples, which is one-sided and cannot represent the entire sample. GAN has a better expansion effect on core data, and the expanded data is only distributed in part of the real data. It has the advantage of concentration, but the data is not representative. The samples of WGAN not only reflect the overall advantage of change, but also the data completely contains real samples, which proves the effectiveness of WGAN model data expansion and provides an effective solution for the lack of motor fault data and imbalanced fault data.
As shown in Table 1, the data samples generated by both GAN and WGAN can conform to the characteristics of real samples through experimental verification, but the GAN generation model actually only learns real samples, which is one-sided and cannot represent the whole sample. GAN has a better expansion effect on core data, and the expanded data is only distributed in part of the real data, which has the advantage of centralization, but the data is not representative. The samples of WGAN not only reflect the overall change advantage, but also the data completely contains the real samples. It proves the Through experimental verification, the data samples generated by GAN and WGAN can meet the characteristics of real samples. However, the GAN generation model actually only learns real samples, which is one-sided and cannot represent the entire sample. GAN has a better expansion effect on core data, and the expanded data is only distributed in part of the real data. It has the advantage of concentration, but the data is not representative. The samples of WGAN not only reflect the overall advantage of change, but also the data completely contains real samples, which proves the effectiveness of WGAN model data expansion and provides an effective solution for the lack of motor fault data and imbalanced fault data.
As shown in Table 1, the data samples generated by both GAN and WGAN can conform to the characteristics of real samples through experimental verification, but the GAN generation model actually only learns real samples, which is one-sided and cannot represent the whole sample. GAN has a better expansion effect on core data, and the expanded data is only distributed in part of the real data, which has the advantage of centralization, but the data is not representative. The samples of WGAN not only reflect the overall change advantage, but also the data completely contains the real samples. It proves the effectiveness of data expansion of the WGAN model, which provides an effective solution for missing fault data and fault data imbalance of the motor.   Figure 22, the classification effect of Figure 23 is obviously optimized. Comparing Figures 21 and 23, the data generated by WGAN has better convergence, that is, the WGAN model has better convergence ability, so it can be concluded that the generation ability of WGAN is stronger than that of GAN, and at the same time it verifies the better generation ability and classification ability of the VAE-WGAN model.

Validity Analysis of Data Generated by VAE-WGAN
From Figures 20 and 21, the influence of the number of iterations on the feature learning ability of the VAE-GAN can be obtained. The greater the number of iterations, the more obvious the feature classification. The two features of mean and variance can explain the data distribution. This section presents the feature of the mean. The black dots represent the mean of the hidden layer characteristics of the electromagnetic torque, and the red dots represent the mean of the hidden layer characteristics of the negative sequence current. First observe the distribution of individual color features. The more concentrated the distribution, the more concentrated the generated data, indicating the stronger the convergence and generation ability of the generative model.    Figure  22, the classification effect of Figure 23 is obviously optimized. Comparing Figures 21 and  23, the data generated by WGAN has better convergence, that is, the WGAN model has better convergence ability, so it can be concluded that the generation ability of WGAN is stronger than that of GAN, and at the same time it verifies the better generation ability and classification ability of the VAE-WGAN model.    Figure  22, the classification effect of Figure 23 is obviously optimized. Comparing Figures 21 and  23, the data generated by WGAN has better convergence, that is, the WGAN model has better convergence ability, so it can be concluded that the generation ability of WGAN is stronger than that of GAN, and at the same time it verifies the better generation ability and classification ability of the VAE-WGAN model.

Comparison of Feature Classification Models
In order to accurately obtain the model comparison results regarding the efficiency,

Comparison of Feature Classification Models
In order to accurately obtain the model comparison results regarding the efficiency, the accuracy of sample feature classification is taken as the evaluation criterion, combined with the other three feature classification models, which are SVM-GAN, SVM-WGAN, and VAE-GAN, and compared with VAE-WGAN to analyze the optimal scheme. Accuracy is the percentage of correct results of the prediction in the total sample, that is, the ratio of the sum of the four correct classifications in this article to the total sample. The test results are shown in Table 2 below: Based on the same motor sample data, the experiment iterates 2000 and 5000 times to improve the accuracy of sample feature classifications. It can be seen from the table that the feature classification ability of VAE is better than SVM, the data generation ability of WGAN is better than GAN, and the data classification accuracy of expansion after VAE-WGAN is the highest, which verifies the conclusion of Section 4.4.

Conclusions
In view of the characteristics of non-stationary, non-linear, multi-source heterogeneity, low value density, and imbalanced fault data collected by online monitoring equipment of permanent magnet synchronous motors, which makes the fault mechanism analysis difficult, this paper proposed a fault feature extraction method based on VAE-WGAN for a permanent magnet synchronous motor. Firstly, VAE-WGAN was selected as the fault feature extraction model and its network parameters were set. Then, the two-dimensional data features composed of mean and variance were used to fit the original data, so as to expand the data samples. Finally, these two eigenvalues were used to measure the classification effect of the improved model.
Technically, this paper combined VAE and GAN, shared the decoder and generator, and used Wasserstein distance to represent the loss function, which avoids the problem of gradient disappearance. In terms of experimental analysis, this paper used the polynomial fitting method and the comparative analysis between the original GAN and WGAN and used the CORREL function value to compare the correlation between the original data and the generated data to verify the effectiveness of WGAN expansion. The negative sequence current and electromagnetic torque were selected to extract the two-dimensional feature mean, and the visual analysis was carried out in the two-dimensional coordinate system. Through iterating 2000 and 5000 times, compared with various models, the effective classification effect of VAE, the better generation ability and classification ability of VAE-WGAN, and the highest classification accuracy were verified. The experimental results showed that the VAE-WGAN studied in this paper has good fault feature extraction effects.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.