1. Introduction
In industrial production, the operating condition of electric motors directly affects the efficiency of critical equipment [
1]. The rise of deep learning has spawned a large body of methods for fault diagnosis and prognosis, yielding substantial progress in industrial applications that require cross-domain generalization and unknown-class fault diagnosis [
2,
3]. However, as motor hardware continues to improve, collecting realistic and fully annotated fault data has become increasingly difficult [
4]; training such models typically demands large quantities of reliable normal and faulty samples to secure strong generalization. To address data scarcity and distribution shift, data augmentation, synthetic data generative modeling, and self- or semi-supervised learning are often coupled with simulation-to-real co-design and consistency constraints, thereby enhancing the effectiveness of model training [
5].
In machine learning and the broader computer industry, data augmentation has traditionally relied on manual simulation and synthesis, often through noise injection [
6], interpolation-based synthesis [
7], or fuzzy processing [
8]. With the advent of deep learning, augmentation methodologies have become more automated and sophisticated, incorporating reinforcement learning-driven policies [
9], analog synthesis [
10], and mixed-sample strategies [
11] to enhance effectiveness and generalization. Nevertheless, given the rapid evolution of neural architectures, many of these approaches remain comparatively simple and may lack diversity and realism, thereby limiting their applicability to more complex models.
The advent of Generative Adversarial Networks (GANs) [
12] and Variational Autoencoders (VAEs) [
13] has opened new avenues for data augmentation. GANs leverage adversarial training to synthesize realistic samples, thereby expanding training sets in data-scarce scenarios, while VAEs facilitate small-sample learning through self-supervised objectives and unsupervised pretraining. Building on these foundations, numerous derivatives have emerged, including DCGAN [
14,
15] with deep convolutional/transpose-convolutional architectures, WGAN [
16] based on the Wasserstein distance, conditional variants such as C-GAN [
17] and CVAE [
18], CycleGAN [
19] employing cycle-consistency constraints, and Factor-VAE [
20] promoting disentangled latent factors. Collectively, these semi-supervised and unsupervised methods reduce reliance on manual labeling and learn high-dimensional representations by modeling the underlying data distribution. The MDGCML framework—a multi-source domain gradient-coordination approach—enhances robustness to noise and unknown factors in few-sample regimes, thereby enabling more effective learning and training [
21]. Together, these advances have substantially propelled the development of data augmentation methodologies [
22].
However, when confronted with long and variable-length time series, conventional GAN-based models exhibit reduced efficiency: they prioritize distributional alignment while often overlooking temporal dependencies and long-range dynamics. As a result, their performance on time-series generation is frequently suboptimal [
23].
TimeGAN [
24] is a generative framework for time series that couples a sequence encoder with an autoregressive RNN within a multi-stage GAN architecture. It captures both distributional characteristics and temporal dynamics through an unsupervised and self-supervised co-training scheme that learns a shared latent space; operating in this hidden space enhances interpretability and improves similarity and predictability of the synthesized data [
25,
26]. In industrial motors, operational vibration and acoustic noise are ubiquitous, and faults introduce periodic anomalies and impulsive disturbances. The prevalence of outliers and the long temporal span complicate TimeGAN’s learning of motor-state vibration signals, often yielding sequences that diverge from the true distribution over extended horizons. To mitigate long-term dependency issues, the RNN component in TimeGAN can be replaced with LSTM [
27,
28] or Bi-LSTM [
29,
30]; 1D CNNs also retain strong temporal feature-extraction capability, capturing local- and multi-scale patterns along the time axis with high parameter efficiency [
31]. For multi-source signals, an OOBN-based framework performs decision-level fusion via Bayesian networks and evidence weighting, thereby improving inference reliability; in addition, residual-analysis and excitation-factor modules enhance robustness to noise and operational uncertainties [
32,
33]. Incorporating self-attention mechanisms [
34] or Transformer modules [
35] into TimeGAN strengthens the modeling of latent representations and temporal dependencies, further improving generative performance.
Despite demonstrated improvements in data quality, several challenges remain for augmenting class-specific motor-state data: (1) for periodic signals (vibration and current), many generators underemphasize temporal dependencies and their alignment with the underlying process, leading to ineffective cycle-level learning; (2) models trained purely in the time domain or under weak conditioning exhibit limited robustness to the intrinsic variability and noise of the raw signals; and (3) for multivariate time series, generators tend to match overall statistical properties while neglecting informative frequency-domain structure.
In this paper, we present Transformer-TimeGAN, a generative framework that leverages time-domain envelope signals and instantaneous phase variations to strengthen temporal representation learning. Within the generator, a cross-attention mechanism fuses time- and frequency-domain representations, while a learnable dynamic-weighting module adaptively balances their contributions, thereby improving synthesis fidelity. The contributions of this study are summarized as follows:
- (1)
In the time-domain generator, a Transformer architecture integrates features from the raw waveform, the envelope, and instantaneous phase variations. A time-step (positional) embedding reinforces temporal dependencies, thereby enhancing the model’s capacity to learn informative time-domain representations.
- (2)
The generator hierarchy is divided into a time-domain hierarchy as well as a frequency-domain hierarchy, both of which are learned and then dynamically weighted to make the total generator more focused on the fusion features.
- (3)
Label-Conditioning Constraints: Discrete labels are embedded into a continuous conditional space and, after linear projection, concatenated with the input signal. Layer-wise conditioning in the latent space steers features toward the specified condition, ensuring that the generated outputs remain well-aligned with the original signal.
- (4)
The delay loss function is established to emphasize the continuity and correlation between each time step and the subsequent time steps, smoothing out the delay loss, and combining the discriminator loss as one of the criteria for judging the generation of the generator.
The paper is organized as follows:
Section 2 describes the generative model constructed in this paper;
Section 3 validates the differences between the generated data and the original data using the publicly available dataset [
36], as well as the self-constructed dataset as an example;
Section 4 provides the conclusions.
2. Materials and Methods
2.1. Theory
TimeGAN consists of an embedding layer, a recovery (decoder) layer, a supervision layer, and a generator, as shown in
Figure 1. Unlike conventional GANs, TimeGAN introduces an explicit encoding component, a joint training scheme, and a tailored loss that couple adversarial and supervised objectives to learn temporally coherent latent representations. The goal is to learn a model from the original dataset
, such that the joint density over the static feature vector
and the temporal feature sequence
denoted
, approximates the true distribution
in (1). Using an autoregressive factorization, we further define per-time-step targets with density
that approximate
for each
in (2).
denotes a probability distribution, and
is the sequence length used for training.
Under an idealized GAN formulation, the distribution-matching objective reduces to the Kullback–Leibler (KL) divergence, whereas the supervised component corresponds to maximum-likelihood estimation.
TimeGAN has three main types of joint losses:
2.1.1. Reconstructions Loss (Embedding and Recovery)
The embedding and recovery layers establish a mapping between observed features and a latent space, enabling the adversarial module to capture fundamental temporal dynamics through low-dimensional representations. The embedding function
maps the static feature vector
and the dynamic sequence
to their latent counterparts
and
, respectively,
denotes the static encoder that maps the static features to the latent static space ; is the resulting latent vector. denotes the temporal encoder, applied iteratively along the sequence, which maps to the latent temporal state .
The recovery function
maps latent representations back to the original spaces
and
for producing reconstructions,
and denote the decoders that recover the embedded latent static and temporal representations to the original spaces and , respectively.
The reconstruction loss
is defined to assess whether the embedding–recovery pipeline accurately maps the latent variables
to
so that the reconstructions match the original data
.
2.1.2. Unsupervised Loss (Generator and Discriminator)
The sequence generator operates on static and dynamic inputs sampled from prior noise spaces
and
, respectively. Random vectors
and
are drawn and transformed into the generative latent spaces
and
. The generator function
maps the static–dynamic tuple of noise vectors to latent codes,
.
denotes the static generator that maps the static noise space to the latent space ; denotes the temporal generator, applied recursively over time, which maps to the latent state .
The discriminator, analogous to the generator, operates in the embedding space. The mapping
takes static and temporal latent codes as input and returns discrimination scores
.
denotes latent embeddings that may correspond to encoded real data
or generated data
, which is the output
represent class-posterior scores.
where
and
are the static and temporal discriminative components, respectively.
and
respectively denotes forward and backward hidden-state sequences.
and
are recurrent transition functions. The unsupervised loss function
reflects the adversarial relationship between the generator and the discriminator, which maximizes the likelihood of truly classifying the training data
by the discriminator, while minimizing this likelihood for the generator, versus generated sequences
, whereas the generator is trained to minimize this objective and thereby fool the discriminator.
2.1.3. Supervised Loss
The supervised term
encourages the generator to match per-time-step conditional distributions. It contrasts the encoder-induced real conditional
with the generator-induced conditional
using a maximum-likelihood criterion of negative log-likelihood.
where
denotes the expected value under the distribution.
The learning procedure incorporates two additional objectives. The embedding and recovery modules are optimized to preserve temporal relevance while compressing the representation by minimizing a weighted combination of the supervised and reconstruction terms:
The generator is trained, in adversarial interplay with the discriminator, to maintain classification accuracy while reducing the supervised loss, via
where
stands for embedding layer;
stands for recovery layer;
stands for generator;
stands for discriminator,
; and
stands for hyperparameter;
takes the value of 1;
takes the value of ten.
2.2. Proposed Method
Figure 2 illustrates Transformer–TimeGAN, which integrates five components: an embedding layer, a recovery layer, a supervision layer, a generator, and a discriminator. In the time domain, the Hilbert transform is applied to the raw signals to obtain the envelope and instantaneous phase, which serve as auxiliary cues. Labeled vibration and current signals constitute the primary inputs, while their envelope and phase features are treated as secondary inputs. These streams undergo independent feature extraction and are subsequently fused through a Transformer layer. In parallel, frequency-domain descriptors are derived via the Fourier transform to construct a frequency-domain generator. The time- and frequency-domain generators are then fused, with their contributions dynamically regulated by learnable weights. Along the generation pathway, the data are processed by linear projections, residual blocks, normalization, and position-wise feed-forward networks. The synthesized outputs are finally validated using a comprehensive suite of performance metrics.
Within the framework, the embedding, recovery, supervision, and discriminator modules adopt the same transformer-based backbone as the time-domain branch. The generator itself consists of two coordinated submodules—a time-domain generator and a frequency-domain generator.
2.3. Data Calculation and Extraction
Motor-operating-status signals (vibration and current) are acquired by heterogeneous sensors distributed across the equipment. These channels exhibit homogeneity and cross-channel correlation, and their mutual influence intensifies as the channel count increases. Consequently, channel-aware guidance is required so that each channel is generated in a direction consistent with its real counterpart, enabling the model to better capture the underlying data distribution, latent structure, and temporal dependencies. To this end, the framework extracts the envelope
and instantaneous phase from the raw signal via the Hilbert transform (12) and phase analysis (13). These features are used as auxiliary inputs that help the generator more effectively recover the intrinsic characteristics of the original signals.
where
denotes the original signal;
is the complex analytic signal obtained via the Hilbert transform
(equivalently denoted
). The envelope and phase are given by
and
, respectively. The instantaneous phase change is computed by discrete differencing with boundary handling as seen in (14).
denotes the phase difference. The three input streams (raw signal, envelope, and ) are projected to a common feature dimensionality via linear layers. Optional label conditioning is realized by mapping discrete labels to a continuous conditional embedding. After temporal reconstruction to match the sequence length, this embedding is integrated with the original signal, and a subsequent linear projection restores the original signal dimensionality.
2.4. Generator Model
The condition-augmented raw signal is first passed through a positional-encoding layer; after injecting time-step information, it is fed to the encoder for feature extraction. The Hilbert-derived envelope and instantaneous-phase features are then supplied to a cross-attention block together with the encoded signal, guiding the hierarchy to attend to salient latent characteristics of the original waveform.
Signal synthesis is realized through two coordinated branches: a time-domain branch and a frequency-domain branch. The frequency-domain representation is obtained by applying a fast Fourier transform (FFT) to the time-domain signal and is then projected via a linear layer to match the time-domain dimensionality. The two branches are fused using scaled dot-product attention, and the fused representation is propagated to the subsequent layer.
indexes samples within a mini-batch; denotes the frequency bin of a sample; and indexes the feature (channel) dimension. Complex-valued quantity and denote its real and imaginary parts, respectively.
The fusion computation is illustrated in
Figure 3 and
Figure 4. In the time-domain module, the vibration signals and current signals
as the queries
for the cross-attention block. The Hilbert-derived envelope
and the instantaneous-phase feature
are concatenated along the feature dimension to form
.
In the first attention block, the vibration and current signals are linearly projected to form the queries
. The envelope and instantaneous-phase features are concatenated along the feature axis, projected to a common dimensionality, and used as keys
and values
, Similarities are computed via scaled dot-product
, and normalized with a softmax to obtain attention weights, which are then applied to
to yield the fused representation. Across the three heads, head-wise weights are adaptively assigned before aggregation. The block output is finalized with a residual connection followed by layer normalization.
Linear projections are computed according to Equation (17), and attention weights follow the scaled dot-product formulation in Equation (18). In the frequency-domain branch, the output of the time-domain generator serves as the queries while the frequency-domain features provide the keys and values .
2.5. Delay Loss
In time-series modeling, temporal attributes are inseparable from the underlying data distribution; for periodic sequences, preserving temporal structure is particularly important. Although TimeGAN incorporates time-step labels during training, non-stationary signals with complex, vibration-like variations often drive optimization toward global distributional matching, thereby underemphasizing fidelity at local time steps. To mitigate this effect, a delay-consistency loss
is introduced. To enforce alignment between generated and real sequences at corresponding time indices (within a selected time-step range) using a mean-squared-error criterion:
where
denotes the training (real) data and
the generated data;
is the batch size;
is the sequence length,
is the feature (channel) dimension. A smaller value of
indicates better generation within localized segments by enforcing agreement between each sample and its
-step-shifted counterpart.
2.6. Measurement Methods
FID (Fréchet Inception Distance) compares the feature-space distributions of real and generated data by matching their first two moments (mean vectors
and covariance matrices
) (smaller is better).
DTW (Dynamic Time Warping) computes the minimum cumulative alignment cost between two sequences
and
over all monotone warping paths (smaller is better).
Wasserstein Distance measures the minimal “transport cost” required to move probability mass from
to
(smaller is better).
KS (Kolmogorov–Smirnov) test compares the one-dimensional distributions of real and generated samples via the maximum difference between their empirical CDFs (smaller is better).
MMD (Maximum Mean Discrepancy) measures the discrepancy between multivariate distributions in an RKHS induced by kernel
; with a characteristic kernel, MMD
if the two distributions are identical (smaller is better).
AUC (Area Under the ROC Curve), used in classifier two-sample tests, quantifies how well a discriminator separates real from generated data; it equals the probability that a random real sample receives a higher score than a randomly generated one (closer to 0.5 is better).
The discriminative score assesses separability between generated and real data using a binary classifier. Scores near chance level indicate that the classifier cannot reliably distinguish the two sets—hence higher fidelity—whereas larger deviations from chance suggest poorer generative quality.
The predictive score trains a forecaster on synthetic sequences and evaluates one-step predictions on real sequences; the metric is the mean absolute error (MAE) averaged over time steps, with lower values indicating that the dynamics learned from synthetic data better match those of the real data.
PCA and t-SNE visualize the distributions of real and generated samples in a low-dimensional space, while power spectral density (PSD) compares their energy distributions in the frequency domain.
4. Conclusions
This study presented TFT–TimeGAN, a time–frequency generative framework that augments vibration and current signals. The method integrates envelope and instantaneous phase features to emphasize local temporal changes, fuses time- and frequency-domain representations through multi-head cross-attention with dynamic weighting, and applies delay-consistency constraints and label conditioning to strengthen temporal alignment and class control. Validation was conducted on the PU dataset with three channels and on a self-constructed dataset with six channels, together with ablation variants.
Across both datasets, the model achieved uniformly low distributional distances—FID, DTW, and 1-Wasserstein—as well as reduced KS, MMD, and discriminator AUC approaching chance (0.5), indicating close alignment with real data in both global statistics and sequence-level timing. PCA and t-SNE showed substantial overlap between synthetic and real point sets, while PSD comparisons confirmed preservation of frequency-domain energy structure. In downstream evaluation, classifiers trained with the proposed data achieved an accuracy exceeding 93%, and discriminative and predictive scores remained low, supporting both realism and task usefulness. Ablation results further verified that each component—time–frequency fusion, dynamic weighting, delay-consistency loss, and conditioning—contributed to the observed gains. Overall, TFT-TimeGAN offers a robust and generalizable pathway for high-fidelity augmentation of multichannel motor signals, improving data quality.
Compared with baseline models, TFT–TimeGAN retains a comparable forward (sampling) pathway; however, multi-feature fusion and attention introduce additional training-time sample and memory overhead and increase inference latency and computational load. The model is also sensitive to hyperparameters such as hidden dimensionality and the number of attention heads. To mitigate these issues, future work may adopt mixed-precision training (AMP) and self-supervised pretraining to reduce training epochs and memory usage, employ local or linear-time attention to compress inference cost, and incorporate adaptive mechanisms together with stabilization strategies such as EMA, spectral normalization, and early stopping, thereby improving hyperparameter robustness while maintaining training stability.