Synthetic Leak Data Generation Using Variational Autoencoders to Address Data Imbalance in Acoustic Emission-Based Pipe Leak Detection

Park, Byungjae; Ryu, Hyejeong; Yoo, Hyeongmin

doi:10.3390/app16063050

Open AccessArticle

Synthetic Leak Data Generation Using Variational Autoencoders to Address Data Imbalance in Acoustic Emission-Based Pipe Leak Detection

by

Byungjae Park

¹,

Hyejeong Ryu

^2,*

and

Hyeongmin Yoo

^3,*

¹

School of Mechanical Engineering, Korea University of Technology and Education, Cheonan-si 31253, Republic of Korea

²

Department of Mechatronics Engineering, Kangwon National University, Chuncheon-si 24341, Republic of Korea

³

School of Mechanical Engineering, Kookmin University, Seoul 02707, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 3050; https://doi.org/10.3390/app16063050

Submission received: 11 February 2026 / Revised: 5 March 2026 / Accepted: 16 March 2026 / Published: 21 March 2026

(This article belongs to the Special Issue Recent Developments in Acoustic Emission and Non-Destructive Evaluation)

Download

Browse Figures

Versions Notes

Featured Application

The proposed method can be applied beyond pipe monitoring to structural monitoring and the detection of abnormal states that may occur during the manufacturing process.

Abstract

A synthetic data generation method is proposed to mitigate data imbalance in pipeline leak detection using acoustic emission (AE) sensors. Collecting sufficient AE signals in the leak state is challenging due to the rarity of leaks and safety concerns. The rarity of leaks leads to highly imbalanced datasets. The performance of leak detection methods may be degraded because the models tend to be biased towards the normal state. The proposed method utilizes a variational autoencoder (VAE) to probabilistically model the difference between the normal-state and leak-state spectrograms. After training the VAE with the spectrogram differences, the decoder of the VAE generates spectrogram differences from random latent vectors. Synthetic leak-state spectrograms are created by adding the generated spectrogram differences to normal-state spectrograms. The effectiveness of the proposed method is evaluated by comparing the leak detection performance of models trained with and without the proposed method. A leak detection model trained with synthetic leak data generated by the proposed method shows improved detection performance compared to models trained using existing oversampling methods.

Keywords:

leak detection; acoustic emission; oversampling; synthetic data generation; variational autoencoder

1. Introduction

Pipes are widely used in manufacturing facilities and chemical plants to efficiently transport gases or liquids. However, if a leak occurs, it can lead to major accidents causing casualties or environmental pollution. Therefore, real-time monitoring of vibrations generated by the pipes is necessary for early leak detection [1,2].

Various sensors are used to monitor the condition of pipes, such as flow meters, pressure sensors, fiber-optic sensors, and acoustic emission (AE) sensors [3,4,5,6]. Among these sensors, AE sensors are suitable for collecting vibration signals from a pipe because they can detect small high-frequency vibrations despite their compact size and low power consumption [7,8]. AE sensors made of flexible materials can be attached to curved pipe surfaces to collect AE signals effectively [9]. However, signals collected by AE sensors contain ambient noise from various sources in manufacturing facilities or chemical plants because of their high sensitivity [10].

Various machine learning-based leak detection methods have been proposed [11,12]. To train models of these methods, a dataset containing both normal-state and leak-state signals is required. However, collecting such a dataset from real manufacturing facilities or chemical plants presents challenges. For effective training of a leak detection model, a dataset containing both normal-state and leak-state signals is required. Collecting signals from AE sensors during normal operation is relatively easy in real manufacturing facilities or chemical plants where pipes are actively in use. In contrast, collecting AE signals during leakage events is considerably more challenging. This is because leaks rarely occur in practice, and intentionally inducing leaks for data collection is restricted due to safety concerns. As a result, depending on the data collection environment and conditions, it may be possible to collect only a small number (one or a few) of leak-state signals. This often leads to a significant imbalance between normal-state AE signals and leak-state AE signals. In practice, the ratio of normal-state signals to leak-state signals can easily reach 100:1 or even higher. If the imbalance issue is not addressed, the leak detection performance of the trained models may be degraded because the models are biased towards the normal state.

To address the imbalance issue, this paper proposes a synthetic leak data generation method. Unlike the existing oversampling methods that rely on simple duplication or interpolation of the limited leak-state data [13,14], the proposed method generates synthetic leak-state data by probabilistically modeling the difference between normal and leak states. First, both normal and abnormal data are converted into spectrograms [15], and the differences between these spectrograms are calculated. Second, these spectrogram differences are used to train a variational autoencoder (VAE) [16]. The probabilistic distribution of the spectrogram differences is learned through the encoder–decoder architecture of the VAE. Finally, the decoder of the trained VAE generates spectrogram differences. The generated differences are then added to the normal-state spectrograms to synthesize leak-state spectrograms.

The proposed method assumes a scenario where normal-state data is abundant and abnormal-state data is very limited. In this scenario, the proposed method mitigates data imbalance by generating a large amount of synthetic leak-state data from a small amount of real leak-state data along with abundant normal-state data. The proposed method is an oversampling strategy used to reduce the class bias of leak detection models. It does not alter the prior probability of leak events in real-world environments. If the proposed method is applied appropriately, it can reduce the false detection rate by improving the discrimination boundary characteristics. Experimental results support the effectiveness of the proposed method in improving leak detection performance by reducing the false detection rate. The applications of the proposed method can be extended beyond pipe leak detection to include monitoring other manufacturing equipment.

This paper is organized as follows. In Section 2, related work on pipe leak detection using AE sensors and data imbalance handling methods is reviewed. Section 3 describes the data acquisition process using AE sensors in a testbed that simulates pipe leaks. Section 4 explains the proposed method for synthetic leak-data generation using VAE. Section 5 presents experimental results that validate the effectiveness of the proposed method. Section 6 discusses conclusions and future works.

2. Related Work

2.1. Pipe Leak Detection Using Acoustic Emission Sensors

Various methods have been proposed to detect pipe leaks using AE sensors. The proposed methods commonly employ an approach that extracts features from AE signals containing noise, and then uses a classifier to detect leaks using the extracted features. A leak detection method using support vector machines (SVMs) with rough set theory and artificial bee colony (ABC) has been proposed [17]. Wavelet double thresholding has been proposed to remove white noise from AE signals before feature extraction [18]. Capturing the unique time-frequency signature generated by leaks in AE signals using short-time Fourier transform (STFT) and tuned wavelet transform has been proposed to detect pipe leaks [19]. A method has been proposed to detect leaks by applying K-means clustering to the intrinsic mode function (IMF) containing leak signals after decomposing AE signals using empirical mode decomposition (EMD). The reliability of the detection results was verified by the consistent distribution of statistical features within the clusters identified as leaks [20]. A pipe leak detection method has been proposed to accurately estimate the time delay of AE signals by integrating the energy and frequency characteristics of AE signals based on the wavelet transform [21]. End-to-end leak detection using a convolutional neural network (CNN) has been proposed [22]. Widely used CNN architectures such as ResNet18 have been employed to detect pipe leaks using spectrograms of AE signals [23]. A CNN–LSTM hybrid model using scalogram images generated by continuous wavelet transform (CWT) has been proposed for pipe leak detection [24].

The methods described above generally require sufficient acquisition of both normal-state data and leak-state data during model training to accurately detect pipe leaks.

2.2. Data Imbalance Handling

Although its severity depends on the data collection environment and conditions, data imbalance is difficult to avoid when training classification models. The ideal strategy is collecting a sufficient amount of data for each class, but this is often impractical. Consider the leak detection problem in pipes addressed in this paper. Data collection under normal operating conditions is relatively straightforward, whereas collecting data under leak conditions is considerably more challenging. As a result, acquiring a balanced number of data points for both normal and leak states is difficult.

To address the data imbalance issue, various methods have been proposed for handling data imbalance when training classification models. One common approach involves assigning different weights to the minority and majority classes when calculating the loss. This reduces the influence of the majority class [25,26].

Undersampling and oversampling methods have been proposed to resolve the imbalance issue at the data level. Random undersampling removes samples from the majority class in a random manner until the number of samples in the majority class matches that of the minority class [27]. Tomek Links deletes the data point belonging to the majority class when two data points are the closest neighbors but belong to different classes [28]. Unlike random undersampling or its variants, oversampling methods generate new data points from the minority class until the number of samples in the minority class matches that of the majority class. Random oversampling simply duplicates existing data points in the minority class [14]. Synthetic minority oversampling technique (SMOTE) is one of the most widely used methods. New data points are generated by interpolating between existing minority class data points [29]. Adaptive synthetic sampling (ADASYN) is an approach that improves SMOTE to generate more data points from hard-to-classify minority data [30].

The methods mentioned above may not be effective when the minority class data is too small or when the data has high dimensionality [29].

3. Data Acquisition

Collecting leak data from gas pipelines in operational settings is difficult and dangerous. Therefore, a testbed was constructed using a 35 mm galvanized pipe, similar to those used in actual manufacturing facilities, to simulate leaks (Figure 1). Nitrogen gas is injected at one end of the pipe, and a gas leak that commonly occurs at flange connections between pipes in chemical plants can be simulated by loosening the flange at the opposite end. The pressure of the injected nitrogen gas controls the flow rate of the leak. Normal-state data is collected when the flange is securely tightened (zero-leak flow), while leak-state data is collected when the flange is loosened to simulate a leak. The flexible AE sensor [9] was attached 150 mm from the leak location. The sensor is a prototype manufactured by Rina Solution [31]. A flexible AE sensor is fabricated with a three-layer thin-film structure. The outer layer consists of electrodes for signal transmission and polyimide. The sensor is 0.2 mm thick, and its first resonant frequency is approximately 20 kHz [9]. The middle layer is made of piezoelectric PZT fibers embedded in epoxy, which convert mechanical vibrations into electrical signals. Signals collected by the AE sensor are amplified with a gain of 40 dB and then sampled at 200 kHz. Figure 2 shows the normal- and leak-state AE signals collected from the testbed.

The collected AE signal is converted into a spectrogram [15]. A spectrogram represents the frequency variation of a signal over time in a two-dimensional time-frequency domain by applying STFT [32] to a one-dimensional time-domain signal.

A collected AE signal is represented as

s [n]

:

S = {s [0], s [1], \dots, s [N - 1]} .

(1)

The AE signal is divided into overlapping frames. The m-th frame signal

s_{m} [l]

is represented as

s_{m} [l] = s [l + m H], l = 0, 1, \dots, L - 1,

(2)

where H and L are the hop size and frame length, respectively.

The windowed m-th frame signal

x_{m} [l]

is obtained by multiplying the frame signal

s_{m} [l]

with the window function:

x_{m} [l] = s_{m} [l] \cdot w [l],

(3)

where

w [l]

is the window function to reduce spectral leakage [33] caused by the discontinuities at the edges of each frame.

w [l] = a_{0} - a_{1} cos (\frac{2 π l}{L - 1}),

(4)

where

a_{0}

and

a_{1}

are constants of the window function. The windowed signal

x_{m} [l]

is transformed into the frequency representation

X (m, k)

by applying the discrete Fourier transform (DFT) [34]:

X (m, k) = \sum_{l = 0}^{L - 1} x_{m} [l] e x p (- j \frac{2 π k l}{L}), k = 0, 1, \dots, L - 1 .

(5)

The frequency representation

X (m, k)

is converted into the spectrogram

P (m, k)

by calculating the squared magnitude:

P (m, k) = {| X (m, k) |}^{2} .

(6)

Figure 3 shows spectrograms of the normal and leak states of the AE signals. Compared to the spectrograms in the normal state, those in the leak state exhibit increased signal responses across multiple time frames and frequency bins.

The normal-state spectrograms above (Figure 3a) show a small number of dominant frequency components maintaining constant positions and intensities throughout time with low background noise. In contrast, the leak-state spectrogram in Figure 3b reveals that the dominant frequency components’ energy has dispersed and noise has increased across a broad frequency range.

4. Synthetic Leak Data Generation

4.1. Difference of Spectrograms

The proposed method mitigates the data imbalance issue by synthesizing the spectrograms of the leak state using multiple normal-state spectrograms and a limited number of leak-state spectrograms. For this purpose, the differences between the leak-state and the normal-state spectrograms are initially calculated (Figure 4).

D_{i, j} = P_{i} - P_{j}, P_{i} \in P_{l e a k}, P_{j} \in P_{n o r m a l},

(7)

where

D_{i, j}

is the difference-of-spectrogram (DoS) image calculated by subtracting the normal-state spectrogram

P_{j}

from the leak-state spectrogram

P_{i}

.

The number of leak-state spectrograms is much lower than that of normal-state spectrograms because the acquisition of leak-state spectrograms is more difficult than that of normal-state spectrograms:

| P_{l e a k} | ≪ | P_{n o r m a l} | .

(8)

Although the number of leak-state spectrograms is far lower than the number of normal-state spectrograms, a large number of DoS images can still be computed and then used to train the VAE model. This is because the total number of possible leak–normal spectrogram pairs is

| P_{l e a k} | \times | P_{n o r m a l} |

. Figure 5 shows examples of DoS images by subtracting normal-state spectrograms from leak-state spectrograms.

4.2. Training of Variational Autoencoder Using Differences-of-Spectrogram Images

If a DoS image can be generated, it can be combined with a normal-state spectrogram to synthesize the leak-state spectrogram. VAE is used to synthesize DoS images by modeling the probabilistic distribution of the DoS images (Figure 6). VAE is one of the generative models. By transforming input data into a latent space and then sampling from that space, VAE reconstructs the input data [16].

A VAE model has three components: an encoder, a latent space module, and a decoder (Figure 6). The distribution of the training data D is mapped to a probabilistic distribution in the latent space by the encoder:

q_{ϕ} (z | d) : D \to Z,

(9)

where

q_{ϕ} (z | d)

is the encoder network parameterized by

ϕ

, d is the input DoS image, and z is the latent variable.

μ

and

σ

are the mean vector and standard deviation vector of the distribution in the latent space, respectively.

By using the reparameterization trick, the latent space module samples the latent variable

\tilde{z}

from the distribution:

\tilde{z} = μ + σ ⊙ ϵ, ϵ \sim N (0, I),

(10)

where

ϵ

is a random vector sampled from

N (0, I)

. The decoder reconstructs the input data from the latent variable

\tilde{z}

:

p_{θ} (d | \tilde{z}) : \tilde{Z} \to D,

(11)

where

p_{θ} (d | \tilde{z})

is the decoder network parameterized by

θ

.

To compute the parameters of the VAE model, the following loss function is used [35]:

L = - E_{q_{ϕ} (z | d)} [log p_{θ} (d | z)] + β D_{K L} (q_{ϕ} (z | d) ∥ p (z)),

(12)

where

p (z)

is the prior distribution of the latent variable, and

D_{K L}

is the Kullback–Leibler divergence (KLD) [36].

β

is a weighting factor that balances the two terms in the loss function. The first term is the reconstruction loss. The second term is the KLD between the latent distribution produced by the encoder and the prior distribution

p (z) = N (0, I)

.

Table 1, Table 2 and Table 3 show the detailed architectures of the encoder, latent space module, and decoder, respectively. DoS images are normalized to the range

[0, 1]

before training the VAE model, and then resized to 256 × 128 pixels. The encoder has five convolutional layers and one linear layer (Table 1). The latent space module has two linear layers to produce the mean and log-variance vectors and a reparameterization layer to sample the latent variable (Table 2). The decoder consists of four pixel shuffle layers and one interpolation layer (Table 3). The pixel shuffle layers [37] are used to upsample the feature maps to reduce checkerboard artifacts that may occur during the upsampling process. The interpolation layer is used to refine the upsampled feature maps and produce the final output DoS images.

Every layer in the VAE model, except for the interpolation layer in the decoder, uses Leaky ReLU as the activation function [38]. To normalize the output values of the interpolation layer between 0 and 1, the sigmoid function is used as its activation function.

After training, the decoder of the VAE model is used to generate DoS images from randomly sampled latent variables [16]. The generated DoS images are denormalized to the original scale, and then added to randomly sampled normal-state spectrograms to synthesize leak-state spectrograms. Synthetic leak-state spectrograms are generated by adding the generated DoS images to normal-state spectrograms (Figure 7).

4.3. Training of Leak Detection Models

Synthetic leak-state spectrograms can be used to resolve the imbalance of data between normal and leak states. When training a supervised leak detection model, generated leak-state spectrograms are used along with real leak-state spectrograms whose number is far lower than that of normal-state spectrograms (Figure 8).

5. Experimental Results

The effectiveness of the proposed method was evaluated under a severe data imbalance scenario where normal data were abundant and leak data were extremely limited. This assumption reflects the practical constraint that obtaining sufficient leak data from gas pipelines in real manufacturing facilities or chemical plants is challenging. To reflect the constraint and verify that the proposed method can effectively work under conditions where only an extremely limited number of leak-state data points are available, a highly challenging experimental setting was intentionally designed by restricting the number of leak-state spectrograms to just two.

The prototype flexible AE sensor [31] (Rina Solution, Cheonan-si, Chungcheongnam-do, Republic of Korea) was used to collect AE signals from the testbed. Each AE signal was collected at a sampling rate of 200 kHz for 1 s in the testbed and downsampled to 144 kHz before converting to a spectrogram. The flow rate of the leak was set to 4 L/min, which corresponds to a small-scale leak [39], by loosening the flange in the testbed for collecting leak-state AE signals. The parameters used for the STFT to convert the AE signals into spectrograms are shown in Table 4.

5.1. VAE Model Training Using Differences-of-Spectrogram Images

PyTorch (2.7.1) was used to implement the VAE model, and it is trained on a computer with an NVIDIA GeForce RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA).

Before training the VAE, DoS images were generated using normal-state spectrograms and leak-state spectrograms. A total of 2300 DoS images were created from 1150 normal-state spectrograms and two leak-state spectrograms.

The VAE model was trained for 500 epochs. AdamW optimizer was used for training the VAE model with a learning rate of 0.0001 and a batch size of 32. The weighting factor

β

in the loss function was set to 0.001 to balance the reconstruction loss and the KLD. The input size of the VAE model was set to 256 × 128 by resizing the DoS images.

Samples of the generated DoS images are shown in Figure 9. The generated DoS images seem to have similar patterns to the real DoS images shown in Figure 5. This indicates that the VAE model has learned the distribution of the real DoS images and the decoder of the VAE model can generate DoS images that follow the learned distribution.

5.2. Leak Detection Models

The decoder of the trained VAE model was used to generate synthetic DoS images. The synthetic leak-state spectrogram was generated by adding the generated DoS images to normal-state spectrograms sampled from a statistical model defined by the mean and variance of the normal-state spectrograms. Figure 10 shows examples of the generated synthetic leak-state spectrograms. This figure demonstrates that the proposed method can generate diverse synthetic leak-state spectrograms by combining different normal-state spectrograms with generated DoS images.

The magnified view of real and synthetic leak-state spectrograms (Figure 11) shows that the synthetic leak-state spectrograms have similar characteristics to the real leak-state spectrograms. Both the real leak spectrograms on the left and the synthetic leak spectrograms on the right show multiple high-energy responses resembling broken bands in the mid-to-low frequency range, along with weak energy responses appearing as noise-like patterns around them. This similarity indicates that the synthetic leak-state spectrograms effectively mimic the distinctive characteristics of the real leak-state spectrograms.

Two hundred normal-state spectrograms and two real leak-state spectrograms were used to train the leak detection model. To mitigate data imbalance, 198 synthetic leak-state spectrograms were generated using the proposed method, random oversampling [27], SMOTE [29], and ADASYN [30]. The generated synthetic leak-state spectrograms were combined with the real leak-state spectrograms to create a balanced training dataset. We also used the unbalanced dataset without oversampling as a baseline to compare the performance of the proposed method and other oversampling methods. Four different leak detection models were implemented: (1) logistic regression, (2) decision tree [40], (3) support vector machine (SVM) [41], and (4) gradient boosting [42]. One thousand normal-state spectrograms and 1000 leak-state spectrograms were used as the test dataset to compare the performance of the trained leak detection models. Before training the leak detection models, the spectrograms were flattened into one-dimensional vectors, and the dimensionality was reduced using principal component analysis (PCA) to 128 dimensions because spectrograms have high dimensions (256 × 128) [43].

The performance of leak detection models trained using the proposed method and other oversampling methods is compared in Table 5. The table shows accuracies and F1-scores of leak detection models trained using naive (no oversampling), random oversampling, SMOTE, ADASYN, and the proposed method. The proposed method outperforms the other oversampling methods across all four leak detection models. Both the accuracies and F1-scores are higher when the models are trained using the proposed method. When trained using other oversampling methods, the logistic regression model achieves higher accuracy and F1-score than the other models, but its performance is still lower than that of the models trained using the proposed method. In contrast, decision tree, SVM, and gradient boosting show poor performance when trained using other oversampling methods. However, these models show considerable performance improvements when trained using the proposed method.

The performances of SMOTE and ADASYN are similar to those of random oversampling because the effectiveness of SMOTE is limited when only a few leak-state spectrograms are available for training. They generate synthetic data by interpolating between two data points in the minority class. This indicates that more minority-class samples than in the current scenario are required to generate more diverse synthetic samples. The diversity of synthetic samples generated by SMOTE or ADASYN is constrained when the quantity of minority-class samples is very limited, as in our scenario. As a result, the performance improvements achieved by SMOTE and ADASYN are similar to those of random oversampling, which simply duplicates existing minority-class samples without introducing new variations.

The leak detection models trained using the proposed method show improved performance compared to those trained with widely used oversampling methods, such as SMOTE and ADASYN. SMOTE and ADASYN generate synthetic samples by interpolating between existing samples in the feature space. If the number of leak samples is very limited, SMOTE and ADASYN may not generate diverse synthetic samples. Meanwhile, the proposed method generates more diverse synthetic spectrograms by learning a latent distribution of DoS spectrograms using a VAE.

Figure 12 shows the confusion matrices of leak detection models trained using SMOTE and the proposed method. Logistic regression model trained with SMOTE incorrectly classifies some normal-state spectrograms as leak-state spectrograms. The decision tree, SVM, and gradient boosting models trained with SMOTE demonstrate poor leak detection performance. These models misclassify most leak-state spectrograms as normal-state spectrograms. Meanwhile, all four leak detection models trained with the proposed method show substantially fewer misclassifications for both normal- and leak-state spectrograms. This verifies that the proposed method can effectively mitigate the data imbalance by generating more diverse synthetic leak-state spectrograms than other oversampling methods.

6. Conclusions and Future Work

This paper proposes a method to generate synthetic leak-state spectrograms using a VAE model for mitigating data imbalance in leak detection. The proposed method generates DoS images using the VAE model trained on a few leak-state spectrograms and many normal-state spectrograms. The decoder of the trained VAE model generates DoS images from randomly sampled latent variables. The synthetic leak-state spectrograms are generated by combining the generated DoS images with normal-state spectrograms. The proposed method effectively mitigates data imbalance by generating synthetic leak-state spectrograms using only a small number of leak-state spectrograms. Its effectiveness is evaluated by comparing the performance of various leak detection models trained using different oversampling methods.

The proposed method presents a preliminary result on resolving the data imbalance issue using generative models. The proposed method can be extended in several directions. The generative model using the proposed method can be improved to generate higher-quality leak-state spectrograms that more closely resemble real leak-state spectrograms. A smaller generative model can be developed to synthesize leak-state data by using an alternative representation instead of a spectrogram. The degree of leakage can be considered as an additional condition for generating leak-state spectrograms. AE sensors can be attached not only to pipes but also to other manufacturing equipment to detect various types of faults. The application of the proposed method can be extended to other fault detection tasks in manufacturing facilities that suffer from data imbalance issues.

Author Contributions

Conceptualization, B.P. and H.Y.; methodology, B.P. and H.R.; software, B.P. and H.R.; validation, H.R. and H.Y.; investigation, H.Y.; resources, H.Y. and B.P.; data curation, H.R.; writing—original draft preparation, B.P. and H.R.; writing—review and editing, H.Y.; visualization, B.P. and H.R.; supervision, H.R. and H.Y.; project administration, B.P.; funding acquisition, B.P. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Ministry of Science and ICT (No. NRF-2022R1C1C1010931) and the project for Collabo R&D between Industry, University, and Research Institute funded by Korea Ministry of SMEs and Startups in 2025 (RS-2025-02310175).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

During the preparation of this work the authors used ChatGPT 5.2 and DeepL Translator in order to improve the readability and language of the manuscript. After using these tools/services, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AE	Acoustic Emission
VAE	Variational Autoencoder
SVM	Support Vector Machine
ABC	Artificial Bee Colony
STFT	Short-Time Fourier Transform
IMF	Intrinsic Mode Function
EMD	Empirical Mode Decomposition
CNN	Convolutional Neural Network
LSTM	Long Short-Term Memory
CWT	Continuous Wavelet Transform
SMOTE	Synthetic Minority Oversampling Technique
ADASYN	Adaptive Synthetic Sampling
DFT	Discrete Fourier Transform
DoS	Difference of Spectrograms
KLD	Kullback–Leibler Divergence
PCA	Principal Component Analysis

References

Yang, D.; Lee, S.; Lee, J. Crack growth degradation-based diagnosis and design of high pressure liquefied natural gas pipe via designable data-augmented anomaly detection. J. Comput. Des. Eng. 2023, 10, 1531–1546. [Google Scholar] [CrossRef]
Hu, J.; Zhang, L.; Liang, W. Detection of small leakage from long transportation pipeline with complex noise. J. Loss Prev. Process Ind. 2011, 24, 449–457. [Google Scholar] [CrossRef]
Rahmat, R.F.; Satria, I.S.; Siregar, B.; Budiarto, R. Water Pipeline Monitoring and Leak Detection using Flow Liquid Meter Sensor. Iop Conf. Ser. Mater. Sci. Eng. 2017, 190, 012036. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, S.; Li, J.; Jin, S. Leak detection monitoring system of long distance oil pipeline based on dynamic pressure transmitter. Measurement 2014, 49, 382–389. [Google Scholar] [CrossRef]
Wu, H.; Duan, H.F.; Lai, W.W.L.; Zhu, K.; Cheng, X.; Yin, H.; Zhou, B.; Lai, C.C.; Lu, C.; Ding, X. Leveraging Optical Communication Fiber and AI for Distributed Water Pipe Leak Detection. IEEE Commun. Mag. 2024, 62, 126–132. [Google Scholar] [CrossRef]
Li, M.; Chen, Y.; Wang, G.; Wen, Z.; Yang, X. Online Vibration Detection in High-Speed Robotic Milling Process Based on Wavelet Energy Entropy of Acoustic Emission. Int. J. Precis. Eng. Manuf.-Green Technol. 2025, 12, 1117–1132. [Google Scholar] [CrossRef]
Rajendran, V.; Prathuru, A.; Fernandez, C.; Faisal, N. Acoustic emission wave propagation in pipeline sections and analysis of the effect of coating and sensor location. Nondestruct. Test. Eval. 2025, 40, 3004–3034. [Google Scholar] [CrossRef]
Park, B.; Lee, S.; Yoo, H. Detecting Small Leaks in Pipeline with Semi-Supervised Ensemble Learning Using Acoustic Emission Sensor. Int. J. Precis. Eng. Manuf. 2025, 26, 3255–3266. [Google Scholar] [CrossRef]
Park, J.; Lee, S.; Lee, B.J.; Kim, S.J.; Yoo, H. Acoustic Emission (AE) Technology-based Leak Detection System Using Macro-fiber Composite (MFC) Sensor. Compos. Res. 2023, 36, 429–434. [Google Scholar]
Yu, L.; Li, S. Acoustic emission (AE) based small leak detection of galvanized steel pipe due to loosening of screw thread connection. Appl. Acoust. 2017, 120, 85–89. [Google Scholar] [CrossRef]
Banjara, N.K.; Sasmal, S.; Voggu, S. Machine learning supported acoustic emission technique for leakage detection in pipelines. Int. J. Press. Vessel. Pip. 2020, 188, 104243. [Google Scholar] [CrossRef]
Adegboye, M.A.; Fung, W.K.; Karnik, A. Recent advances in pipeline monitoring and oil leakage detection technologies: Principles and approaches. Sensors 2019, 19, 2548. [Google Scholar] [CrossRef] [PubMed]
Wongvorachan, T.; He, S.; Bulut, O. A comparison of undersampling, oversampling, and SMOTE methods for dealing with imbalanced classification in educational data mining. Information 2023, 14, 54. [Google Scholar] [CrossRef]
Bach, M.; Werner, A.; Żywiec, J.; Pluskiewicz, W. The study of under-and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Inf. Sci. 2017, 384, 174–190. [Google Scholar] [CrossRef]
Chen, H.; Xie, W.; Vedaldi, A.; Zisserman, A. Vggsound: A large-scale audio-visual dataset. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), Barcelona, Spain, 4–8 May 2020; pp. 721–725. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Mandal, S.K.; Chan, F.T.; Tiwari, M. Leak detection of pipeline: An integrated approach of rough set theory and artificial bee colony trained SVM. Expert Syst. Appl. 2012, 39, 3071–3080. [Google Scholar] [CrossRef]
Jin, H.; Zhang, L.; Liang, W.; Ding, Q. Integrated leakage detection and localization model for gas pipelines based on the acoustic wave method. J. Loss Prev. Process Ind. 2014, 27, 74–88. [Google Scholar] [CrossRef]
Ahadi, M.; Bakhtiar, M.S. Leak detection in water-filled plastic pipes through the application of tuned wavelet transforms to acoustic emission signals. Appl. Acoust. 2010, 71, 634–639. [Google Scholar] [CrossRef]
Ali, A.; Xinhua, W.; Razzaq, I. Pipeline leak detection through implementation of empirical mode decomposition and cluster analysis. Measurement 2025, 248, 116873. [Google Scholar] [CrossRef]
Wang, X.; Zhao, M.; Li, S. An improved cross-correlation algorithm based on wavelet transform and energy feature extraction for pipeline leak detection. In ICPTT 2012: Better Pipeline Infrastructure for a Better Life; American Society of Civil Engineers: Reston, VA, USA, 2012; pp. 577–591. [Google Scholar]
Song, Y.; Li, S. Gas leak detection in galvanised steel pipe with internal flow noise using convolutional neural network. Process Saf. Environ. Prot. 2021, 146, 736–744. [Google Scholar] [CrossRef]
Peng, H.; Xu, Z.; Huang, Q.; Qi, L.; Wang, H. Leakage detection in water distribution systems based on logarithmic spectrogram CNN for continuous monitoring. J. Water Resour. Plan. Manag. 2024, 150, 04024015. [Google Scholar] [CrossRef]
Saleem, F.; Ahmad, Z.; Kim, J.M. Real-Time pipeline leak detection: A hybrid deep learning approach using acoustic emission signals. Appl. Sci. 2024, 15, 185. [Google Scholar] [CrossRef]
Huang, Y.M.; Du, S.X. Weighted support vector machine for classification with uneven training class sizes. In Proceedings of the 2005 International Conference on Machine Learning and Cybernetics, Guangzhou, China, 18–21 August 2005; pp. 4365–4369. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Menardi, G.; Torelli, N. Training and assessing classification rules with imbalanced data. Data Min. Knowl. Discov. 2014, 28, 92–122. [Google Scholar] [CrossRef]
Swana, E.F.; Doorsamy, W.; Bokoro, P. Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors 2022, 22, 3246. [Google Scholar] [CrossRef]
Blagus, R.; Lusa, L. SMOTE for high-dimensional class-imbalanced data. Bmc Bioinform. 2013, 14, 106. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
Smart Sensor, Smart Solution Renovation in Achievement. Available online: https://rinasolution.com/ (accessed on 18 January 2023).
Sejdić, E.; Djurović, I.; Jiang, J. Time–frequency feature representation using energy concentration: An overview of recent advances. Digit. Signal Process. 2009, 19, 153–183. [Google Scholar] [CrossRef]
Chen, K.F.; Mei, S.L. Composite interpolated fast Fourier transform with the Hanning window. IEEE Trans. Instrum. Meas. 2010, 59, 1571–1579. [Google Scholar] [CrossRef]
Richardson, M. Fundamentals of the discrete fourier transform. Sound Vib. Mag. 1978, 12, 5. [Google Scholar]
Burgess, C.P.; Higgins, I.; Pal, A.; Matthey, L.; Watters, N.; Desjardins, G.; Lerchner, A. Understanding disentangling in β-VAE. arXiv 2018, arXiv:1804.03599. [Google Scholar]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Xu, B.; Wang, N.; Chen, T.; Li, M. Empirical evaluation of rectified activations in convolutional network. arXiv 2015, arXiv:1505.00853. [Google Scholar] [CrossRef]
Gemeinhardt, H.; Sharma, J. Machine-learning-assisted leak detection using distributed temperature and acoustic sensors. IEEE Sens. J. 2024, 24, 1520–1531. [Google Scholar] [CrossRef]
Loh, W.Y. Classification and regression trees. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2011, 1, 14–23. [Google Scholar] [CrossRef]
Cervantes, J.; Garcia-Lamont, F.; Rodríguez-Mazahua, L.; Lopez, A. A comprehensive survey on support vector machine classification: Applications, challenges and trends. Neurocomputing 2020, 408, 189–215. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Zhai, X.; Jelfs, B.; Chan, R.H.; Tin, C. Short latency hand movement classification based on surface EMG spectrogram with PCA. In Proceedings of the 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Orlando, FL, USA, 16–20 August 2016; pp. 327–330. [Google Scholar]

Figure 1. A testbed using a 35 mm galvanized pipe to simulate leaks. The magenta box indicates the leak location, and the cyan box indicates the location of the flexible acoustic emission sensor. Nitrogen gas is injected at one end of the pipe, and a gas leak is simulated by loosening the flange at the opposite end.

Figure 2. Acoustic emission signals in normal and leak states. (a) Normal-state AE signal. (b) Leak-state AE signal.

Figure 3. Normal- and leak-state spectrograms of AE signals collected from the testbed. The normal-state spectrograms show a small number of dominant frequency components. Meanwhile, the leak-state spectrograms show dispersed energy of the dominant frequency components and increased noise across a broad frequency range. (a) Normal-state spectrograms. (b) Leak-state spectrograms.

Figure 4. Generation of differences-of-spectrogram images.

Figure 5. Examples of differences-of-spectrogram images generated by subtracting normal-state spectrograms from leak-state spectrograms. The dispersed energy of the dominant frequency components and increased noise across a broad frequency range are observed in the DoS images.

Figure 6. Training of variational autoencoder using differences-of-spectrogram images.

Figure 7. Generation of synthetic leak-state spectrograms by using the decoder of the trained VAE model, which generates synthetic differences of spectrograms from random latent variables.

Figure 8. Training of leak detection model using real and synthetic spectrograms.

Figure 9. Samples of generated differences-of-spectrogram images using the decoder of the trained VAE model. The characteristics of the generated DoS images are similar to those of the real DoS images. They also show similar patterns to the real leak-state spectrograms.

Figure 10. Samples of synthetic leak-state spectrograms generated by combining normal-state spectrograms with generated differences of spectrograms.

Figure 11. Magnified view of real and synthetic leak-state spectrograms to compare their characteristics. The synthetic leak-state spectrograms have similar energy response patterns to the real leak-state spectrograms.

Figure 12. Confusion matrices of leak detection models trained using synthetic leak-state spectrograms generated by (left) SMOTE and (right) the proposed method: (a) logistic regression, (b) decision tree, (c) SVM, and (d) gradient boosting.

Table 1. Architecture of the encoder.

Layer Name	Kernel	Output Shape
conv2d	3 × 3, 512, stride 2	128 × 64
conv2d	3 × 3, 256, stride 2	64 × 32
conv2d	3 × 3, 128, stride 2	32 × 16
conv2d	3 × 3, 64, stride 2	16 × 8
conv2d	3 × 3, 32, stride 2	8 × 4
linear	-	1024

Table 2. Architecture of the latent space module.

Layer Name	Kernel	Output Shape
linear (mean)	-	128
linear (logvar)	-	128
reparameterization	-	128

Table 3. Architecture of the decoder.

Layer Name	Kernel	Output Shape
decoder input	-	1024
pixel shuffle	3 × 3, 64, stride 2, upscale 2	16 × 8
pixel shuffle	3 × 3, 128, stride 2, upscale 2	32 × 16
pixel shuffle	3 × 3, 256, stride 2, upscale 2	64 × 32
pixel shuffle	3 × 3, 512, stride 2, upscale 2	128 × 64
interpolation	bilinear, upscale 2	256 × 128
interpolation	3 × 3, 1, stride 1	256 × 128

Table 4. STFT parameters for converting AE signals into spectrograms.

Parameter	Value
Frame length	1024
Hop size	512
Window type	Hann window

Table 5. Leak detection performance comparison under different oversampling methods when the input vector dimension is 128. The “Baseline” column indicates the performance obtained using the original dataset without oversampling. Upper values in each cell indicate accuracies, and lower values indicate F1-scores.

Model	Oversampling Method
Model	Baseline	Random	SMOTE	ADASYN	Proposed
Logistic Regression	0.901	0.890	0.890	0.889	0.990
Logistic Regression	0.901	0.901	0.901	0.900	0.990
Decision Tree	0.518	0.519	0.519	0.519	0.973
Decision Tree	0.037	0.071	0.071	0.071	0.972
SVM	0.500	0.585	0.629	0.630	0.926
SVM	0.000	0.289	0.409	0.413	0.920
Gradient Boosting	0.518	0.543	0.543	0.543	0.973
Gradient Boosting	0.037	0.157	0.157	0.157	0.972

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Park, B.; Ryu, H.; Yoo, H. Synthetic Leak Data Generation Using Variational Autoencoders to Address Data Imbalance in Acoustic Emission-Based Pipe Leak Detection. Appl. Sci. 2026, 16, 3050. https://doi.org/10.3390/app16063050

AMA Style

Park B, Ryu H, Yoo H. Synthetic Leak Data Generation Using Variational Autoencoders to Address Data Imbalance in Acoustic Emission-Based Pipe Leak Detection. Applied Sciences. 2026; 16(6):3050. https://doi.org/10.3390/app16063050

Chicago/Turabian Style

Park, Byungjae, Hyejeong Ryu, and Hyeongmin Yoo. 2026. "Synthetic Leak Data Generation Using Variational Autoencoders to Address Data Imbalance in Acoustic Emission-Based Pipe Leak Detection" Applied Sciences 16, no. 6: 3050. https://doi.org/10.3390/app16063050

APA Style

Park, B., Ryu, H., & Yoo, H. (2026). Synthetic Leak Data Generation Using Variational Autoencoders to Address Data Imbalance in Acoustic Emission-Based Pipe Leak Detection. Applied Sciences, 16(6), 3050. https://doi.org/10.3390/app16063050

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Synthetic Leak Data Generation Using Variational Autoencoders to Address Data Imbalance in Acoustic Emission-Based Pipe Leak Detection

Featured Application

Abstract

1. Introduction

2. Related Work

2.1. Pipe Leak Detection Using Acoustic Emission Sensors

2.2. Data Imbalance Handling

3. Data Acquisition

4. Synthetic Leak Data Generation

4.1. Difference of Spectrograms

4.2. Training of Variational Autoencoder Using Differences-of-Spectrogram Images

4.3. Training of Leak Detection Models

5. Experimental Results

5.1. VAE Model Training Using Differences-of-Spectrogram Images

5.2. Leak Detection Models

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI