1. Introduction
Natural gas, an efficient and clean energy source, is widely used in industrial manufacturing and urban energy systems, driven by the dual-carbon strategy. Intelligent and intensive extraction technologies have improved the performance of natural gas wells. Ensuring operational safety and stability has become increasingly important.
Typical faults include freeze plugging at the wellhead or in the wellbore, fluid accumulation in the wellbore, and formation energy depletion. If not addressed in time, these faults can lead to capacity loss, equipment damage, unplanned shutdowns, and other serious consequences. Therefore, developing an intelligent fault-diagnosis system is essential to support safe and stable well operations.
In natural gas well fault diagnosis, traditional methods, relying heavily on manual experience and fixed threshold alarms, struggle with the dynamic, nonlinear, and high-dimensional nature of multi-source time series wellsite data, causing delays and false alarms. While plunger lift monitoring in unconventional gas development provides ample production and operational data for ML/AI-based anomaly diagnosis, field data suffer from scarce failure cases, hard-to-capture pre-occurrence fault features, and strong data correlation, limiting model accuracy [
1]. Complex production environments and experience-dependent detection lead to reduced output, unreliability, and high labor costs [
2]. Though nine advanced ML methods have been evaluated on four wells’ sensor data, identifying key features [
3], existing logging tools’ shortcomings in measuring downhole temperature/pressure hinder modern monitoring and accurate fault judgment [
4]. Challenges remain, including scarce samples, weak signal detection, poor model adaptability in complex conditions, and gaps in integrating time series augmentation with classification models for highly dynamic, non-stationary well environments.
In recent years, data-driven diagnostic techniques have made progress in industrial health monitoring. Among them, Diffusion Probabilistic Models (DPMs) have shown strong generative capabilities for enhancing small-sample datasets. Ho et al. (2020) proposed the DDPM model, which models data distributions through a forward noise process and a backward denoising process [
5]. Nichol and Dhariwal (2021) later improved the model’s sampling efficiency and convergence [
6]. Zhao et al. (2025) introduced DDPM into mechanical fault-diagnosis tasks, enhancing classification accuracy and robustness under small-sample conditions [
7]. Yang et al. (2024) proposed a class-wise diffusion strategy to mitigate category imbalance and reduce diagnostic bias [
8]. Zhang et al. (2024) developed an interpretable diffusion model using latent variable techniques for weakly supervised industrial scenarios [
9]. Zhang et al. (2025) also applied DDPM to HVAC systems, demonstrating its applicability in real-world industrial environments [
10].
Meanwhile, Convolutional Neural Network (CNN) and Transformer architectures have become mainstream in deep learning-based diagnosis tasks. CNNs are effective for local feature extraction, while Transformers excel in modeling long-range temporal dependencies. Their fusion has become a trend in fault diagnosis. For instance, Zhu et al. (2023) proposed a multi-scale Transformer-CNN model for robust knowledge transfer across devices [
11]. Chen et al. (2024) designed a BCTGAN framework that integrates CNN and Transformer to handle imbalanced data [
12]. Lai et al. (2025) constructed a parallel CNN-Transformer encoder to balance diagnostic accuracy and lightweight model design [
13]. Zhang et al. (2023) proposed TSViT, showing the Transformer’s ability to capture complex temporal fault patterns [
14]. In addition, Lu et al. (2022) and Pei et al. (2023) introduced mechanisms to improve feature representation and adaptability in complex industrial tasks [
15,
16].
Although diffusion models and deep fusion networks have shown great potential, building an end-to-end diagnostic system for natural gas wells remains challenging. These wells operate in highly dynamic and non-stationary environments, often under small-sample and weak-signal conditions. Moreover, current research lacks integrated methods that combine time series data augmentation with fault-classification models for complex field conditions.
To address the dual challenges of sample scarcity and feature complexity in natural gas well diagnosis, this paper proposes an intelligent fault-diagnosis method that integrates diffusion-based data augmentation with a CNN-Transformer classification model. The key contributions are as follows:
A time series diffusion model is introduced to generate high-quality and diverse synthetic fault samples, thereby alleviating the small-sample problem in real-world scenarios;
A CNN-Transformer-based diagnostic model is constructed to perform multi-class fault identification on enhanced data, capturing both local details and global temporal patterns;
The proposed method is validated on a real-world dataset from natural gas wells. Results show improved classification accuracy, robustness under weak-signal conditions, and better adaptability to complex working environments.
The remainder of this paper is organized as follows:
Section 2 details the natural gas well production process, data characteristics, and common diagnostic challenges.
Section 3 presents the proposed method based on diffusion enhancement and CNN-Transformer modeling.
Section 4 describes comparative experiments and performance evaluations.
Section 5 concludes the paper and discusses future directions.
2. Brief Introduction of Natural Gas Well System
2.1. Gas Well Production and Monitoring System Overview
A natural gas well is a complex operating system that combines underground reservoirs, wellbore structures, surface gathering and transportation devices and online monitoring systems. The typical gas-extraction process includes: natural gas is extracted from the formation, rises to the wellhead through the wellbore, and is processed through separation, dehydration, pressurization and other devices before being transported to the downstream pipeline network. In intelligent oil and gas fields, SCADA (data acquisition and monitoring) systems have been widely deployed to collect multi-dimensional monitoring data in real time, including oil pressure, casing pressure, well temperature and production.
Such multi-source time series data are characterized by non-stationarity, strong coupling, and high noise, and are continuously affected by environmental perturbations and operational adjustments, showing obvious dynamic change patterns. Therefore, an intelligent diagnosis system for natural gas wells must be able to effectively handle the subtle fluctuations and potential anomalies in the high-dimensional time series features, and have strong dynamic modeling and pattern-recognition capabilities [
17].
2.2. Typical Fault Modes and Data Signatures
Combining historical operation data and expert experience, common failure modes of natural gas wells mainly include the following categories:
Wellhead hydrate freezing and plugging: due to the formation of hydrate when the gas at the wellhead is cold, the outlet is blocked, which is characterized by the increase of oil pressure, the obvious drop of well temperature within 2 h, the sudden drop of gas production, and the drop of outgoing pressure, which is often developed rapidly in a short period of time;
Wellbore hydrate plugging: hydrates are formed in the middle or lower part of the wellbore, which is characterized by a drop in production, oil pressure, and well temperature within 2–6 h, but there is no obvious change in the differential pressure between oil pressure and casing pressure, which needs to be distinguished from other types of plugging [
18];
Fluid accumulation in the wellbore (liquid load): the accumulation of liquid in the well impedes gas flow, which is manifested by a rapid decline in production until close to 0 within 3–5 days, a drop in oil pressure, and a significant increase in differential pressure between oil pressure and casing pressure, while the change in casing pressure is insignificant, and it is a gradual type of failure [
19];
Normal production status: all monitoring parameters such as oil pressure, well temperature, production, etc. are within reasonable fluctuation range without significant sudden change or abnormal trend, and the system runs smoothly.
These faults show change patterns under different time scales in the multidimensional monitoring data, but the early signs often have weak signals, overlapping features, and are easily affected by noise, which makes it difficult to recognize them in time by traditional methods.
2.3. Problem Statement
Given multidimensional time series monitoring data:
where
T denotes the length of the time series, and
d is the sensor dimension, the goal is to construct a classification function
that maps an input time series
X to its corresponding operational state
, where
The task faces the following two major challenges:
Small-sample problem: low frequency of occurrence of real faults, scarce number of samples, and severe imbalance in the distribution of the categories;
Multi-scale time series feature extraction: faults evolve at different rates, which requires the model to have both local detail awareness and global dependency-modeling capabilities.
Therefore, this paper aims to design an intelligent diagnostic framework with small-sample augmentation capabilities and complex time series-modeling capability to achieve accurate fault identification in natural gas wells.
4. Experiments and Analysis
This section is dedicated to the comprehensive experimental validation of our proposed framework. Our primary objective is to demonstrate the framework’s effectiveness, particularly in addressing the critical industrial challenge of data scarcity in fault diagnosis. To this end, we have structured our evaluation into four distinct parts. First, we will detail the experimental setup (
Section 4.1), including the real-world dataset, preprocessing pipeline, evaluation metrics, and implementation specifics. Second, we will assess the quality of the synthetic data generated by our diffusion model and showcase its powerful application in data augmentation under various data-scarce scenarios (
Section 4.2). Third, a thorough comparison with other mainstream approaches (
Section 4.2) will be conducted to benchmark our method’s performance. Finally, a series of ablation studies (
Section 4.3) will be presented to dissect our model’s architecture and validate the contribution of its key components, followed by a discussion of the broader implications and limitations of our work.
4.1. Experimental Setup
This section outlines the foundational setup for all subsequent experiments, including a detailed description of the dataset, the metrics used for evaluation, and the specifics of the model implementations.
4.1.1. Dataset and Preprocessing
The dataset for this study was sourced from a real-world monitoring system deployed in a natural gas field. It consists of multivariate time series data capturing three key physical parameters: casing pressure, oil pressure, and external transmission pressure. Based on operational logs, each sample is labeled into one of four classes: Normal Production (NP) and three distinct fault conditions—Wellbore Fluid Accumulation (WFA), Wellbore Freeze Plugging (WFP), and Wellhead Freeze Plugging (HFP). A critical characteristic of this dataset is the significant class imbalance, with fault instances being substantially rarer than normal operational data. This inherent data scarcity poses a major challenge for training robust diagnostic models and serves as the primary motivation for our data augmentation strategy.
During data preprocessing, missing values were handled via linear interpolation to preserve temporal continuity, while the Hampel filter was applied to eliminate outliers due to its interference resistance. Subsequently, as advocated by prior studies, we employed min-max normalization to standardize the values to the [0,1] interval. Finally, a sliding window technique was utilized to slice the original signals into time series segments of 120 timesteps, rendering them suitable for model input. The resulting dataset was then partitioned into training, validation, and testing sets at a 70%/15%/15% ratio. The specific distribution of samples, which highlights the aforementioned class imbalance, is detailed in
Table 1. The validation set was strictly reserved for hyperparameter tuning and model selection.
4.1.2. Evaluation Metrics
To conduct a robust and quantitative evaluation of our framework, we employ a series of evaluation metrics, which are categorized into two types: one for assessing the quality of the generated data, and the other for evaluating the performance of the final fault diagnosis model.
First, to evaluate the quality of the generated data, we rigorously quantify the similarity between synthetic and real samples using three distinct metrics:
Earth Mover’s Distance (EMD): Also known as the first Wasserstein Distance (
), EMD measures the similarity between the global distributions of the generated and real data by calculating the minimum “cost” to transform one distribution into another. A lower EMD value signifies a closer match between the two distributions. It is defined as:
where
and
represent the real and generated data distributions, respectively.
Kullback-Leibler Divergence: KL Divergence quantifies the difference between the probability distributions of the generated (
) and real (
) data. It effectively measures the information lost when approximating the real distribution with the generated one. A smaller KL divergence indicates a better approximation. It is defined as:
Mean Squared Error (MSE): MSE assesses the direct, point-wise similarity between generated and real samples by calculating the average squared difference between corresponding data points. It provides a straightforward measure of reconstruction accuracy. A smaller MSE indicates a higher fidelity. It is defined as:
where
and
are the
i-th data points of the real and generated samples.
Additionally, we adopt the
F1-Score as the primary metric to evaluate the effectiveness of the fault-diagnosis model. This metric is particularly suitable for our task as it provides a harmonic mean of Precision and Recall, offering a balanced assessment in the context of class imbalance. A higher
F1-Score represents a better and more robust classification performance. It is calculated as:
where:
4.1.3. Implementation Details
The proposed framework was implemented in Python 3.10 using the PyTorch 2.1 library. All models were trained and evaluated on a workstation equipped with an Intel Core Ultra 7 155H CPU, 32 GB of RAM, and a single NVIDIA GeForce RTX 4060 GPU.
For the diagnostic models, training was conducted for a maximum of 50 epochs with a batch size of 512. We employed the Adam optimizer with an initial learning rate of
and utilized the cross-entropy loss as the objective function. To prevent overfitting, an early stopping mechanism with a patience of 10 epochs was implemented, monitoring the validation loss. The detailed structural parameters for both our proposed diagnostic model and the data-augmentation module are provided in
Table 2. The final hyperparameters used for training are presented in
Table 3.
4.2. Data Generation Evaluation and Application
This section evaluates the core contribution of our work: the diffusion-based data-augmentation framework. We first assess the quality and fidelity of the synthetic data generated by our DPM. Subsequently, we demonstrate its practical application and effectiveness in enhancing fault-diagnosis performance, particularly under challenging data-scarce conditions.
4.2.1. Evaluation of Generated Sample Quality
Before applying the synthetic data for downstream diagnostic tasks, it is imperative to first validate its quality and fidelity. A high-quality generative model must produce samples that are not only statistically congruent with the real data but also capture the characteristic temporal dynamics inherent to each operational condition. To this end, we conducted a comprehensive evaluation involving both qualitative visual inspection and rigorous quantitative analysis.
For a qualitative assessment,
Figure 4 provides a direct visual comparison between authentic time series samples randomly selected from our dataset and synthetic samples generated by our trained DPM for each of the four operational classes. As can be observed, the generated signals (bottom row) adeptly replicate the distinct morphological patterns, amplitudes, and underlying structures of their real counterparts (top row). The synthetic data closely mirrors the subtle fluctuations and characteristic trends present in the authentic signals, rendering them virtually indistinguishable by visual inspection alone. This provides strong initial evidence of our model’s ability to capture the complex data-generating process.
To complement this qualitative assessment with a more rigorous, objective evaluation, we benchmarked our DPM against three representative deep generative models from recent literature. These baselines were selected to cover advanced variations of both GAN and VAE architectures tailored for fault-diagnosis tasks: 2D-CNN conditional GAN (2D-CNNGAN) [
28], a method that combines a CGAN with a 2D CNN on image-like data representations; Multilabel one-dimensional GAN (ML1D-GAN) [
29], a fully 1D convolutional GAN noted for its strong generalization ability on time series signals; and CVAE with centroid loss (CL-VAE) [
30], a modified VAE that uses a penalization mechanism to improve sample fidelity. All models were trained on the same dataset to generate fault samples, and the quality of their outputs was quantitatively evaluated against the real samples using the metrics defined in
Section 4.1.2.
The comprehensive results of this comparison are summarized in
Table 4. The quantitative analysis unequivocally confirms the superiority of our proposed DPM. It consistently achieved the best (lowest) scores across all four evaluation metrics. Specifically, its low EMD and KL Divergence scores indicate that the data it generates has the highest fidelity to the real data’s global and probabilistic distributions. Furthermore, its leading performance in MSE and DTW demonstrates its exceptional ability to reconstruct both the point-wise values and the complex morphological shapes of the time series signals. This robust evaluation confirms that our DPM is a reliable and high-quality data source, laying a solid foundation for its subsequent application in data augmentation.
4.2.2. Effectiveness of Data Augmentation
Having established the high fidelity of the synthetic data generated by our DPM, we now evaluate its primary application: addressing the class imbalance inherent in our real-world dataset. As detailed in
Table 1, the training set is characterized by a significant disparity in sample counts, with the Normal Production (NP) class vastly outnumbering the three fault classes (WFA, WFP, and HFP). This imbalance can bias a diagnostic model towards the majority class, leading to poor performance on the critical, underrepresented fault conditions.
To assess the effectiveness of our data-augmentation strategy in mitigating this bias, we designed an experiment. we first established a baseline performance by training the CNN-Transformer model solely on the original, imbalanced training set. Subsequently, we progressively augmented this training set by adding varying quantities of synthetic samples (100, 200, 500, and 1000) for each of the three minority fault classes (WFA, WFP, and HFP). This approach aims to rebalance the class distribution in the training data. The performance of the diagnostic model was then evaluated on the real-data test set for each level of augmentation, allowing us to directly quantify the benefits of the synthetic data.
The results of this experiment are visualized in
Figure 5, which plots the macro
F1-Score on the test set as a function of the number of synthetic samples added. The findings unequivocally demonstrate the profound and positive impact of our data-augmentation strategy. The baseline model, trained without any augmentation (0 added samples), achieved a macro
F1-Score of 84.56%. While respectable, this performance is susceptible to biases from the imbalanced data. As synthetic fault samples were introduced, a clear and substantial trend of performance improvement was observed. The curve shows a steep ascent, with the
F1-Score surging to 99.52% after augmenting the training set with 1000 synthetic samples for each minority class. This marked improvement of nearly 8 percentage points underscores the ability of the high-quality synthetic data to effectively compensate for the underrepresentation of fault classes. The rising curve robustly validates our framework’s capacity to mitigate classification bias and develop a more reliable and equitable fault-diagnosis system, highlighting its significant potential for practical industrial applications where collecting balanced fault data is often infeasible.
4.2.3. Comparison with Other Diagnostic Models
To further validate the superiority of our proposed CNN-Transformer architecture, we conducted a comparative analysis against several widely-used deep learning models for time series classification. This experiment aims to isolate the performance of the diagnostic model itself, demonstrating its advanced capability in capturing complex fault patterns from the augmented data.
For this comparison, we replaced our CNN-Transformer model with three alternative architectures: a standard 1D-CNN, a Long Short-Term Memory (LSTM) network, and a Gated Recurrent Unit (GRU) network. To ensure a fair and rigorous benchmark, all models were trained on the exact same augmented dataset, which consists of the original imbalanced training data supplemented with 1000 synthetic samples per minority class, as this configuration yielded the best performance in our previous experiment. All models were trained using the same protocol, with their hyperparameters individually optimized on the validation set.
The performance of each diagnostic model on the real-data test set is summarized in
Table 5. Our proposed CNN-Transformer model achieved the highest macro
F1-Score of 99.52%, outperforming all other architectures. The 1D-CNN, while effective, achieved a lower score of (e.g., 95.83%), likely due to its limitations in modeling long-range dependencies. The recurrent models, LSTM and GRU, scored (e.g., 93.54%) and (e.g., 94.98%) respectively, indicating their proficiency in handling sequential data but perhaps less so in extracting hierarchical features compared to our hybrid approach.
A more granular analysis of the classification behavior is provided in
Figure 6 and
Figure 7. The confusion matrices in
Figure 6 show that while all models perform well, the baseline models (b–d) exhibit noticeable confusion between the WFP and HFP fault types. In contrast, our proposed CNN-Transformer (a) displays a nearly diagonal matrix, indicating significantly higher classification accuracy and minimal errors across all classes.
This performance difference is further explained by the t-SNE feature visualizations in
Figure 7. The feature space of our CNN-Transformer (a) shows four distinct and well-separated clusters, demonstrating a highly discriminative feature representation. The baseline models (b–d), however, produce feature spaces with more overlap and less defined clusters, particularly for the fault classes. This provides compelling evidence that the superior feature learning capability of our hybrid architecture is the key driver behind its state-of-the-art diagnostic performance.
4.3. Ablation Studies and Discussion
To further dissect our proposed framework and validate the contribution of its key components, we conducted a series of ablation studies. This section presents these findings and concludes with a broader discussion of our method’s implications, limitations, and potential future directions.
4.3.1. Ablation Study
To validate the individual contributions of the key components within our proposed framework, we conducted a comprehensive ablation study. The study was designed to answer two critical questions: (1) To what extent does the DPM-based data augmentation contribute to the final diagnostic performance? (2) What is the synergistic effect of combining CNN and Transformer modules in the diagnostic model? To this end, we evaluated four distinct model configurations:
Full Framework: Our complete, proposed method (DPM Augmentation + CNN-Transformer).
No Augmentation: The CNN-Transformer model trained solely on the original, imbalanced dataset, removing the effect of the DPM.
Transformer-Only: A diagnostic model using only the Transformer component, to isolate the contribution of the CNN module.
CNN-Only: A diagnostic model using only the CNN component, to isolate the contribution of the Transformer module.
To ensure a fair comparison, all variants were trained under the identical hyperparameter settings and on the same augmented dataset (where applicable). The results, as measured by the macro F-score on the test set, are presented in
Table 6.
The results of the ablation study provide clear and compelling insights. The most significant performance degradation, a drop of 14.96% in the F1-Score, occurred when data augmentation was removed. This starkly demonstrates that the DPM-based augmentation is the most critical component for overcoming the class imbalance problem and is fundamental to achieving high performance.
Furthermore, the necessity of the hybrid architecture is also evident. Removing either the CNN or the Transformer module resulted in performance drops of 3.69% and 2.38%, respectively. This can be attributed to the fact that the CNN-Only model, while adept at extracting local features, struggles to capture long-range temporal dependencies. Conversely, the Transformer-Only model may overlook fine-grained local patterns that the CNN is designed to capture. The superior performance of the full framework confirms that the CNN and Transformer components work synergistically: the CNN provides a powerful local feature representation that is then globally contextualized by the Transformer, leading to a more robust and discriminative final feature space for diagnosis.
4.3.2. Discussion
The comprehensive experimental results presented in this study robustly validate the effectiveness of our proposed framework, which synergistically integrates a high-fidelity diffusion model for data augmentation with a powerful CNN-Transformer architecture for fault diagnosis. Our findings highlight two key contributions. First, the DPM proved superior to other generative models in creating realistic, high-fidelity time series data, and its application as an augmentation tool was shown to be highly effective in mitigating the diagnostic bias caused by class imbalance. Second, the hybrid CNN-Transformer model consistently outperformed other standard architectures by effectively learning both local feature details and global temporal patterns, a capability confirmed by our ablation studies.
5. Conclusions
This paper proposes a multi-fault-diagnosis method for natural gas wells that integrates a diffusion model with a CNN-Transformer architecture, addressing key challenges in field data such as complex multivariate coupling, scarce fault samples, and dynamic operating conditions. Through a combination of data augmentation and time series modeling, the following critical findings are achieved:
In terms of data augmentation, the 1D diffusion generative network based on diffusion probabilistic modeling successfully generates high-fidelity and diverse time series data for small-sample fault states. Qualitative comparisons show that synthetic samples closely replicate the morphological patterns, amplitudes, and underlying structures of real samples, making them nearly indistinguishable by visual inspection. Quantitative metrics further confirm the superiority of the diffusion model: it achieves a lower Earth Mover’s Distance (EMD = 0.087), KL Divergence (0.245), and Mean Squared Error (MSE = 0.298) compared to baseline models such as 2D-CNNGAN, ML1D-GAN, and CL-VAE, validating its effectiveness in capturing subtle time series patterns.
For fault diagnosis, the hybrid CNN-Transformer model enables in-depth analysis and accurate classification of key operational parameters by fusing local feature extraction and global temporal modeling capabilities. Experimental results demonstrate that on the augmented dataset with 1000 synthetic samples added per minority class, the model achieves a macro F1-Score of 99.52%, significantly outperforming 1D-CNN (95.83%), LSTM (93.54%), and GRU (94.98%). Confusion matrices and t-SNE feature visualizations further reveal that the proposed model achieves near-perfect classification accuracy across all fault types, with four distinct and well-separated clusters in the feature space, whereas baseline models exhibit noticeable class confusion.
Ablation studies verify the necessity of each component: removing the diffusion-based augmentation module results in a 14.96% drop in performance, highlighting its critical role in mitigating class imbalance. Using only the CNN or Transformer leads to performance declines of 3.69% and 2.38%, respectively, confirming the synergistic effect of the two components in capturing local details and global dependencies.
In summary, the proposed data-augmentation mechanism and temporal classification framework outperform traditional methods in terms of structural consistency of generated samples, augmentation effectiveness, and multi-class fault-diagnosis accuracy. The framework is highly transferable and can be extended to other industrial monitoring scenarios with nonlinear and dynamic characteristics. Future work will focus on developing an online learning mechanism to adapt to concept drift from evolving well conditions and enhancing model interpretability to provide operators with actionable insights into fault causality, ultimately advancing a more practical and deployable intelligent fault-diagnosis system.