A Review on Gas Pipeline Leak Detection: Acoustic-Based, OGI-Based, and Multimodal Fusion Methods

Gong, Yankun; Bao, Chao; He, Zhengxi; Jian, Yifan; Wang, Xiaoye; Huang, Haineng; Song, Xintai

doi:10.3390/info16090731

Open AccessReview

A Review on Gas Pipeline Leak Detection: Acoustic-Based, OGI-Based, and Multimodal Fusion Methods

by

Yankun Gong

^*

,

Chao Bao

,

Zhengxi He

,

Yifan Jian

,

Xiaoye Wang

,

Haineng Huang

and

Xintai Song

State Key Laboratory of Advanced Nuclear Energy Technology, Nuclear Power Institute of China, Chengdu 610213, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(9), 731; https://doi.org/10.3390/info16090731 (registering DOI)

Submission received: 30 June 2025 / Revised: 17 August 2025 / Accepted: 18 August 2025 / Published: 25 August 2025

Download

Browse Figures

Versions Notes

Abstract

Pipelines play a vital role in material transportation within industrial settings. This review synthesizes detection technologies for early-stage small gas leaks from pipelines in the industrial sector, with a focus on acoustic-based methods, optical gas imaging (OGI), and multimodal fusion approaches. It encompasses detection principles, inherent challenges, mitigation strategies, and the state of the art (SOTA). Small leaks refer to low flow leakage originating from defects with apertures at millimeter or submillimeter scales, posing significant detection difficulties. Acoustic detection leverages the acoustic wave signals generated by gas leaks for non-contact monitoring, offering advantages such as rapid response and broad coverage. However, its susceptibility to environmental noise interference often triggers false alarms. This limitation can be mitigated through time-frequency analysis, multi-sensor fusion, and deep-learning algorithms—effectively enhancing leak signals, suppressing background noise, and thereby improving the system’s detection robustness and accuracy. OGI utilizes infrared imaging technology to visualize leakage gas and is applicable to the detection of various polar gases. Its primary limitations include low image resolution, low contrast, and interference from complex backgrounds. Mitigation techniques involve background subtraction, optical flow estimation, fully convolutional neural networks (FCNNs), and vision transformers (ViTs), which enhance image contrast and extract multi-scale features to boost detection precision. Multimodal fusion technology integrates data from diverse sensors, such as acoustic and optical devices. Key challenges lie in achieving spatiotemporal synchronization across multiple sensors and effectively fusing heterogeneous data streams. Current methodologies primarily utilize decision-level fusion and feature-level fusion techniques. Decision-level fusion offers high flexibility and ease of implementation but lacks inter-feature interaction; it is less effective than feature-level fusion when correlations exist between heterogeneous features. Feature-level fusion amalgamates data from different modalities during the feature extraction phase, generating a unified cross-modal representation that effectively resolves inter-modal heterogeneity. In conclusion, we posit that multimodal fusion holds significant potential for further enhancing detection accuracy beyond the capabilities of existing single-modality technologies and is poised to become a major focus of future research in this domain.

Keywords:

industrial gases; small leak detection; acoustic detection; optical gas imaging; multimodal fusion detection

1. Introduction

Gas pipeline leaks pose significant risks in industrial settings, such as underground urban networks, chemical plants, and nuclear power plants (NPPs). Common causes include material fatigue, welding defects, corrosion, valve seal failures, and geological hazards [1,2]. Such leakage incidents, once occurring, often lead to extremely severe consequences: grave safety hazards, substantial economic losses, and extensive and persistent environmental damage. On 15 January 2023, an explosion at the Panjin Haoye Chemical Co.’s chemical plant in the city of Panjin, Liaoning Province, China, led to 13 deaths and 35 injuries, as well as approximately CNY 87.99 million in direct economic losses [3]. A January 2025 natural gas pipeline leak at Petronas station in Puchong triggered a deflagration with flames towering hundreds of meters high. The incident resulted in 63 injuries, the destruction of multiple structures and vehicles, and the direct release of substantial quantities of methane into the atmosphere [4]. Leak-before-break (LBB) is a design principle of NPP’s pipelines, prioritizing early leak detection to prevent catastrophic failures [5]. Consequently, achieving efficient and precise detection of leaks within industrial pipelines—particularly early-stage minor leakage and incipient breaches—represents a critical challenge in preventing major safety and environmental incidents and avoiding substantial economic losses, carrying paramount practical significance.

One prevalent approach for gas pipeline leak detection employs numerical simulation or sensor data based on physical models. Diffusion models predict the spatial distribution and concentration trends of leaked gas by establishing atmospheric diffusion equations that incorporate environmental parameters, such as wind speed and temperature gradients. This method is instructive for accident consequence assessment and emergency response planning, but its accuracy heavily relies on real-time environmental data acquisition. And it has a high computational complexity, making it difficult to meet the requirements for rapid response [6]. Two-phase flow analysis proves applicable for gas–liquid mixed leakage scenarios in pipelines or containers. By simulating complex fluid dynamics processes, including interphase forces and interfacial tension, this approach enables a precise analysis of leakage rates and morphological evolution. However, it requires establishing high-precision multiphase flow models, which are computationally demanding and challenging to handle the transient processes of sudden leaks [7].

Another method utilizes sensor detection technology. Acoustic detection achieves non-contact monitoring by capturing specific frequency acoustic waves generated during gas ejection, which offers advantages such as rapid response and broad coverage, being particularly suitable for micro-leakage detection in high-pressure pipelines. Yet, its reliability is compromised by environmental noise interference and frequent false alarms in industrially noisy environments [8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]. Optical detection techniques such as optical gas imaging (OGI) leverage the absorption characteristics of gas molecules at specific light wavelengths to enable visual localization and quantitative analysis. OGI can achieve sensitivity to the ppm level for greenhouse gases like methane, with strong resistance to non-overlapping background gas interference when operating in optimized spectral bands [25,26,27,28,29,30,31,32,33,34,35,36,37,38,39]. However, it is vulnerable to thermal radiation interference and entails relatively high equipment costs when choosing cooled OGI [34]. Multimodal fusion technology enhances robustness and accuracy by integrating multi-source data from acoustic, optical, and chemical sensors, coupled with machine learning for feature extraction and decision optimization. Nevertheless, such systems require resolution of technical challenges, including spatiotemporal synchronization of multiple sensors and heterogeneous data fusion [21,40,41,42,43,44,45,46,47,48,49,50,51,52,53].

This paper mainly collected the literature from the past decade concerning gas leakage detection in surface and subsurface pipelines (excluding underwater pipelines) within industrial settings, particularly focusing on natural gas and liquefied natural gas (LNG) in chemical plants, as well as water intake pipelines and steam pipelines in nuclear power plants. The detection methods investigated encompassed fusion approaches (multi-sensor, multi-domain, and multimodal) based on acoustic and OGI.

The primary contributions of this paper are summarized as follows:

(1): It provides a comprehensive review of gas leak detection methodologies based on acoustic sensing, OGI, and multimodal fusion, along with a comparative analysis between the conventional algorithms and SOTA approaches;
(2): It synthesizes critical datasets for infrared visual recognition methods and evaluates the performance metrics of the SOTA models in this domain;
(3): It elucidates the comparative advantages and limitations of feature-level versus decision-level multimodal fusion, analyzes the underlying causes for the absence of graph neural network (GNN) integration in current frameworks, and proposes future research trajectories for multimodal fusion technologies.

The remaining sections are organized as follows: Section 2 introduces acoustic-based detection systems, Section 3 elaborates on OGI-based detection systems, Section 4 discusses integrated detection systems, and Section 5 and Section 6 present discussion and conclusions along with future prospects.

2. Acoustic-Based Leakage Detection System

2.1. Technical Framework and Implementation

2.1.1. Principle and Classification of Acoustic Detection

Acoustic waves are mechanical vibrations that propagate through a medium, such as air, water, or solids. They exhibit better propagation in liquids and solids compared to gases and cannot propagate in a vacuum [54]. Acoustic impedance, defined as the product of the medium’s density and the speed of sound, determines the reflection and transmission of sound waves at the interface between two media. When the acoustic impedance difference between two media is large, more reflection occurs at the interface, leading to energy loss and affecting the detection performance of sensors. The acoustic impedance of air is very low, approximately 0.00042 MRayl, while that of solids, such as gold, can be as high as 63.8 MRayl. Water has intermediate acoustic impedances, around 1.5 MRayl [55]. Detection equipment must be matched to the acoustic impedance of the medium to minimize energy loss. For example, hydrophones are used for underwater pipeline detection [56], microphones and ultrasonic sensors are used for pipelines exposed to air [8,57], and accelerometers and acoustic emission sensors are used for solid pipelines [58,59].

The acoustic spectrum can be divided into infrasound, audible sound, and ultrasound by frequency. The infrasound frequency band is below 20 Hz, produced by low-frequency vibrations in pipelines due to fluid turbulence. This band can propagate for several kilometers and is suitable for coarse localization in long-distance pipelines, though it has low sensitivity to minor leaks [60]. The audible sound band ranges from 20 Hz to 20 kHz, with typical leak-related frequencies occurring between 100 Hz and 5 kHz. This band captures noise from leak ejection and pipeline structural resonance. By identifying characteristic peaks through spectral analysis, it serves for medium-to-short-distance detection [61]. Frequencies above 20 kHz belong to the ultrasonic range, where high-speed fluid shear generates components between 30 kHz and 100 kHz. This band offers a high signal-to-noise ratio (SNR), along with rapid attenuation, enabling precise localization of minor leaks in short-distance pipelines [62].

2.1.2. Acoustic Sensors

Acoustic sensors can be classified into passive and active modes [63]. Passive detection does not require a sound wave generation device; it relies solely on receiving devices to capture the acoustic signals spontaneously generated during pipeline leaks. Common passive sensors include microphones and acoustic receivers. The advantages of passive detection are its non-intrusiveness to pipeline operation and strong real-time capability, making it well-suited for detecting sudden leaks. However, its detection sensitivity is insufficient for small leaks or under conditions of low SNR. Active detection locates the leakage source by emitting sound waves and analyzing their reflection, scattering, and attenuation characteristics. Its core component is the transducer, which has both transmitting and receiving functions.

Notably, while the sensors themselves output analog signals, digital measurements for both active and passive sensors require that the data acquisition system—specifically the analog-to-digital converter (ADC)—must satisfy the Nyquist sampling theorem, that is the sampling frequency should be more than twice the highest frequency component, in order to reconstruct the signal without distortion and avoid aliasing distortion.

2.1.3. Acoustic Detection Process

Acoustic detection technology uses sensors to capture raw signals, analyzes them, and produces computational results. Its core workflow comprises three key stages: signal acquisition, feature engineering, and pattern recognition.

•: Signal acquisition: The system collects raw acoustic signals from target objects using broadband microphones, piezoelectric sensors, etc.;
•: Feature engineering: Time and frequency domain analysis are typically employed to extract features;
•: Pattern recognition: Leak detection is ultimately achieved through energy threshold detection, machine learning, or end-to-end recognition via deep learning.

2.1.4. Key Challenges and Trade-Offs

The key challenges for acoustic detection are as follows:

•: Micro-leakage or small-leakage detection: Acoustic signals generated by leaks are susceptible to masking by ambient noise, potentially leading to false positives or missed detections [63]. Furthermore, small leaks may not induce significant pressure drops in the system [9], making timely identification difficult. According to acoustic detection principles, a smaller leakage orifice diameter produces higher-frequency signals with shorter detectable distances. Concurrently, higher-frequency signals necessitate increased sampling rates (as dictated by the Nyquist–Shannon theorem), thereby elevating computational requirements [23,60,61,62];
•: Signal distortion and attenuation: Acoustic waves reflecting off the surface of the pipeline can cause signal distortion, affecting the accurate determination of the leak location. The acoustic signals generated by leaks exhibit non-stationary characteristics in both time and frequency domains, meaning that the amplitude, frequency, and phase of the signals may vary over time. This non-stationarity makes it challenging for conventional signal-processing methods to effectively identify and analyze leak signals, especially under low SNR conditions, where the characteristics of the leak signals are not prominent. Noise reduction algorithms must balance between noise suppression and maintaining the integrity of the target signal [12,13]. Additionally, high-frequency components of signals experience significant attenuation during propagation through pipelines, resulting in weak correlation of leakage signals across a wide frequency band [24].

Research on acoustic signal processing concentrates on tackling the aforementioned challenges. The subsequent sections delineate the prevalent acoustic detection methodologies, with particular emphasis on spectral analysis and acoustic signal visualization. Due to the non-stationary characteristics inherent in acoustic signals generated by gas pipeline leaks, Table 1 summarizes commonly employed processing techniques, comparing their differences in temporal resolution, computational efficiency, and noise robustness.

2.2. Spectral Analysis

Spectral analysis serves as a fundamental technique for converting time-domain signals into frequency-domain representations. It quantifies the energy distribution across different frequency components, enabling (1) the identification of dominant frequency constituents, (2) the characterization of energy distribution patterns across frequency bands, and (3) the separation of noise-corrupted bands from valid signals.

Spectral analysis has increasingly been combined with AI algorithms in recent years, serving as a data preprocessing method for feature extraction. AI algorithms possess strong recognition capabilities, enabling better capture of feature information, especially end-to-end neural networks, which are more adept at capturing subtle feature changes. Lu et al. [10] proposed a natural gas pipeline leak feature extraction method based on variational mode decomposition (VMD) and locally linear embedding (LLE). Initially, VMD is used for preliminary feature extraction of pipeline signals, selecting feature modes through dispersed entropy. Subsequently, the LLE algorithm is employed to reduce the dimensionality of the high-dimensional feature matrix, retaining sensitive feature vectors. Finally, classification and recognition are performed using Fisher measurement and support vector machines (SVM). This method achieves a classification accuracy of 95% in pipeline leak detection. Ning et al. [64] introduced a novel architecture (SE-CNN) that combines spectral enhancement (SE) and a CNN based on VGGNet [67], designed to detect natural gas pipeline leaks by addressing dynamic background noise removal. The spectral enhancement method enhances static leak signals through STFT and convolution operations while eliminating non-static noise signals. Spectral enhancement can also be viewed as a data compression method, significantly reducing the size of the original signal, which helps accelerate the training process of neural networks and lower computational costs. Experimental results show that SE-CNN achieves an average accuracy of 94.3% in weak noise environments and 89.1% in strong noise environments, significantly higher than traditional SVM and CNN methods. Yao et al. [13] proposed a dual-feature drift framework for acoustic signal leak detection using a non-parametric design method. By performing feature reverse normalization, the time-cycle features of the signal feature matrix are normalized exponentially, eliminating strong background noise. A feature drift layer is constructed in a one-dimensional convolutional neural network to enhance gradient constraints and eliminate data distribution differences during the model training process. This method achieves a fault recognition accuracy of 95.46% in natural gas pipeline leak detection. Han et al. [14] introduced a wavelet–RBFN method that combines wavelet transform and radial basis function networks to leverage multi-source acoustic signals and reduce false detection rates. The wavelet–RBFN method uses wavelet decomposition theory to break down leak signals into three sub-band signals, reflecting the characteristics of the signals at different time scales and frequencies. The wavelet packet energy, maximum values, and time differences calculated through cross-correlation are selected as the input feature vectors for the RBFN. Compared to the time difference of the arrival method, the relative error of wavelet–RBFN is less than 2%.

The multi-modal and dispersive characteristics of leak signals cause acoustic waves to propagate at different velocities across various frequency ranges, leading to significant errors in leak localization methods based on time delay estimation in practical applications. Guided wave theory is a theoretical framework that studies wave propagation in waveguides and can help identify and analyze the propagation velocities and dispersive characteristics of different acoustic wave modes in pipelines. In the low-frequency range, such as 0 to 10 kHz, leak-induced acoustic signals primarily consist of three modes: bending mode F(1,1), axial mode L(0,1), and torsional mode T(0,1) [16]. The axial mode is a non-dispersive guided wave with a group velocity of approximately 4938 m/s in the low-frequency range (0–2.5 kHz). The torsional mode is also a non-dispersive guided wave with a velocity of approximately 3099 m/s across all frequency ranges [16]. The bending mode is a highly dispersive guided wave, with its group velocity varying from 400 m/s to 1800 m/s in the frequency range of 0–2.5 kHz. In the high-frequency range (10–100 kHz), leak-induced acoustic signals can excite more modes, such as L(0, 2), F(1, 2), and F(1, 3) [8].

Li et al. [16,17] investigated the acoustic detection and localization of natural gas pipeline leaks in the low-frequency range, focusing on addressing the localization errors arising from multimodal guided waves and dispersion effects. Figure 1 shows gas leak detection using acoustic emission technique. Points A and B represent monitoring locations, while Z_A and Z_B denote the distances from the leak source to monitoring points A and B, respectively. In the literature [16], a weighted window function was constructed in the frequency domain based on the wavenumber of a specific mode. This function was applied to the cross-power spectrum of leak acoustic waves to extract non-dispersive single-mode guided waves. Subsequently, the single-mode cross-correlation function, obtained via the inverse Fourier transform, was utilized for time delay estimation. This method reduced the average localization error from >10% using conventional methods to 1.38% in pipelines exceeding 80 m. Subsequently, in 2019, Li et al. explored a method for adaptively identifying coherent frequency bands containing multiple modes in ref. [17], eliminating the need for prior modal information. The experimental setup consisted of a gas pipeline system and an AE detection system. The gas pipeline system had a total length of 110 m, with AE sensors installed on the pipe walls on both sides of the leak point. Signal analysis proceeded as follows: Welch’s averaged periodogram method was first employed to estimate the auto-spectra and cross-spectra of the signals, and the coherence function between the two signals was computed to determine the coherent frequency band. Subsequently, the characteristic frequency band was selected based on the 3 dB bandwidth around the peak frequency of the coherence function, noting that this characteristic band varied with distance. For instance, it was 6–7 kHz at a distance difference of 2 m and 6.3–7.1 kHz at a distance difference of 59.85 m. A bandpass filter was then applied to extract the leak signals within the identified characteristic frequency band. Following this extraction, the phase spectrum of the cross-spectrum within this band was utilized for time delay estimation, leveraging the observation that the phase spectrum exhibited a linear variation across the characteristic frequency band, with its slope representing the time delay between the two signals. Finally, based on the estimated time delay value, the sound velocity of the leak signal was calculated using Equation (1),

v = \frac{| l_{1} - l_{2} |}{\hat{τ_{0}}},

(1)

where

l_{1}

and

l_{2}

are the distances from the leak point to the two sensors, respectively,

τ_{0}

is the actual time delay of the leak signal arrival between the two sensors, and

\hat{τ_{0}}

is the estimated value.

Jiang et al. [8] investigated the alterations in the amplitude and phase of guided wave signals induced by thermo-mechanical coupling variations within high-frequency ranges. The study proposed a method integrating a particle swarm optimization (PSO)-enhanced bidirectional gated recurrent unit (BiGRU) with an attention mechanism for damage identification in nuclear industry pipelines. Researchers employed a sealed spherical electric heating system as the thermal source to simulate high-temperature, high-pressure water within a pressure range of 0.1–15 MPa. The NI data system was utilized to generate excitation signals for ultrasonic guided waves and to collect the corresponding responses. The acquired ultrasonic guided wave signals comprised three primary components: crosstalk signals, direct wave signals, and boundary reflections (including mode-converted waves). To mitigate interference from redundant data, the analysis focused specifically on processing the direct wave signals. Experiments explored the effects of thermo-mechanical coupling variations and structural damage on guided wave signals. The results demonstrated that thermo-mechanical coupling induces significant changes in both the amplitude and phase of guided wave signals, while damage identification based on statistical analysis methods proved ineffective due to insufficient noise resistance. The BiGRU–Attention model exhibited superior feature-learning capabilities, achieving optimal performance in discriminating between different pipeline states. Upon introducing the PSO–BiGRU–Attention model, the classification performance was further enhanced, leveraging its significant advantages in accelerated convergence speed and improved accuracy. This model attained an AUC value of 0.8491, demonstrating exceptional damage identification performance, particularly under varying temperature and pressure conditions.

2.3. Visualization of Acoustic Signature

The methods for converting one-dimensional acoustic time–series signals into two-dimensional images can be divided into time–frequency methods, phase space transformation methods, and state transition methods. Commonly used techniques include the short-time Fourier transform (STFT), continuous wavelet transform (CWT), Stockwell transform (S-transform), Markov transition field (MTF), and recurrence plots (RP) [19,20,59]. STFT, CWT, and S-transform belong to time–frequency representation, which increases data dimensionality by generating a spectrogram and a scalogram. STFT has limited resolution due to the fixed-length short-time windows used to segment the signal, but it offers fast computation. CWT matches the local characteristics of the signal by scaling and translating a mother wavelet function, providing variable resolution, and when combined with STFT, they can complement each other in real-time and at high frequency. The S-transform integrates the phase information from STFT with the multi-resolution properties of wavelet transforms, using frequency-dependent Gaussian window functions, which offer frequency-adaptive resolution but at a higher computational complexity. MTF and RP extend the signal into a two-dimensional space through state probability transitions and phase space transformations, respectively. MTF generates two-dimensional images by spatial expansion, capturing the temporal transition patterns of the signal states, and exhibits good noise resistance and interpretability. However, discretization can lead to loss of detail and assumes that future states depend only on the current state, neglecting long-term dependencies. RP constructs a binary recurrence matrix by measuring the similarity of the signal at different time points, representing the recurrence behavior of the signal in phase space. It can extract global features and retain signal details but is computationally intensive and sensitive to noise.

Song et al. [59] proposed a galvanized steel pipe (GSP) gas leakage detection method based on AE technology and CNN. The experiment collected acoustic signal data under different operating scenarios using AE technology, including pure environmental noise, pure leakage signals, internal flow noise, and mixed signals, with a frequency range of 0–40 kHz. Figure 2 presents the time-frequency distribution generated by wavelet packet decomposition using the db8 mother wavelet. The colors in the figure represent the packet coefficients, with high-energy (high coefficient value) regions depicted in red/yellow and low-energy regions shown in blue/black. Researchers compared two types of signal inputs for the CNN model: one involved applying STFT on the collected acoustic signals to generate spectrograms, which were then input to the CNN model as 2D images, and the other directly used the acoustic spectrum as a 1D signal input. The experimental results revealed that, in mechanical monitoring scenarios, wide kernels in lower layers could effectively suppress noise, differing from computer vision applications that typically employ small-kernel deep networks. Furthermore, the CNN model exhibited high sensitivity to STFT-generated 2D spectrograms, manifesting as test set accuracy lower than the training accuracy (overfitting). The optimal model configuration (3Conv1D-2) using 1D frequency signals as input achieved over 93% accuracy when the leakage rates exceeded 0.03 L/s. Huang et al. [19] introduced a multi-domain encoding-learning algorithm based on acoustic emission, encoding AE signals into images through Gramian angular difference field (GADF) and MTF. GADF captures dynamic characteristics of time–series data while MTF extracts transition features, and their combination enables comprehensive leakage signal feature extraction. Yuan et al. [20] presented a multi-condition pipeline leakage diagnosis method using image fusion and whale optimization algorithm (WOA)-enhanced CNN. The approach first converts 1D leakage time series into 2D acoustic images through STFT, CWT, S-Transform, Markov extended Kalman filter (MEKF), and RP. WOA was employed for CNN parameter optimization. The results demonstrated that the image fusion method outperformed single-image approaches across all conditions, which are 1 mm, 3 mm, and 5 mm holes, and enhanced both recognition accuracy and model robustness.

State-of-the-art acoustic detection systems often employ fusion techniques, such as the fusion of 2D images across different domains [19,20], the integration of acoustic waves with negative pressure waves [21], and the fusion of multi-source acoustic sensors [14]. The accuracy of leakage detection can be further improved by deep-learning algorithms. However, in monitoring high-frequency signals, such as small breaches, acoustic technology imposes high computational demands on processing chips. Additionally, while adhering to the Nyquist sampling theorem, the configuration of sampling frequencies still requires empirical expertise.

3. OGI-Based Leakage Detection System

3.1. Technical Framework and Implementation

3.1.1. Infrared Imaging, Thermal Imaging, and Optical Gas Imaging

Infrared (IR) is a type of electromagnetic wave with wavelengths between visible light and microwaves, spanning 0.8 μm to 100 μm [68]. Infrared imaging is a core application of IR technology. It is a broad term encompassing all imaging techniques within the IR spectral range that generate images by detecting electromagnetic radiation in this wavelength band. Infrared imaging can be divided into two categories: passive imaging, which relies on the thermal radiation naturally emitted by objects, and active imaging, which requires external IR illumination sources. Thermal imaging is a subset of infrared imaging and captures the thermal radiation emitted by all objects above absolute zero temperature without requiring external light sources, generating visual temperature maps through far-infrared band detection. Optical gas imaging (OGI) is a specialized infrared imaging application that detects gas leaks by leveraging specific gas absorption characteristics in mid-to-long wave infrared, requiring sensors or optical filters tuned to target gases’ characteristic absorption wavelengths. This technology is widely used in petrochemical industries and natural gas pipeline leakage monitoring systems.

3.1.2. Principle and Classification of OGI Technologies

OGI is based on the vibrational–rotational energy level transitions of gas molecules. When molecular vibrations induce changes in dipole moment, specific wavelengths of infrared light are absorbed to form characteristic absorption spectra, and OGI achieves gas imaging by detecting these characteristic absorption spectra [68,69]. Consequently, the sensitivity of OGI depends on the magnitude of dipole moment changes in target gases. For instance, SF₆, due to its strong polarity, is more readily detectable, primarily absorbing in the long-wave infrared (LWIR) region. Nonpolar gases such as N₂, O₂, and H₂ exhibit minimal dipole moment changes, making them difficult to image via OGI. In fact, most hydrocarbons exhibit absorption peaks in the band of mid-wave infrared (MWIR).

OGI technology is primarily classified based on its dependence on an additional infrared illumination source into active and passive systems. However, in industrial applications, a more common classification criterion is based on detector cooling requirements, distinguishing between cooled and uncooled systems [70]. Cooled systems require cryogenic cooling to suppress thermal noise in the detector, whereas uncooled systems operate at ambient temperature. The former exhibits higher power consumption, broader infrared wavelength detection coverage (spanning MWIR and LWIR), larger physical size, and higher cost. Uncooled systems are typically designed as portable devices with lower costs, but they exhibit limited sensitivity and are primarily used for detecting short-wave infrared (SWIR) wavelengths. Additionally, OGI systems can also be categorized by detector architecture: focal plane arrays (FPAs), which directly capture spatial information through a pixelated sensor array. Single-pixel detectors, which achieve spatial resolution through a single photodetector combined with computational reconstruction techniques (infrared laser) [71]. Emerging extensions of OGI technology include quantitative leak-rate measurement systems, such as quantitative optical gas imaging (QOGI), which enables rapid leak quantification [72].

3.1.3. OGI-Based Leakage Detection Process

The core process of an optical-based gas leakage detection system can be divided into three sequential phases: image acquisition and preprocessing, feature extraction, and leakage source identification or leakage level determination with localization. The system acquires original infrared grayscale images of target scenarios using infrared cameras, typically requiring preprocessing operations such as denoising, filtering, edge extraction, and optical flow extraction to enhance gas imaging by mitigating background interference. Feature extraction may employ gas diffusion models for theoretical analysis or deep learning for multi-scale feature extraction. This subsequently enables the completion of multiple analytical tasks, including: binary classification (leakage/non-leakage) or multi-class classification for leakage level assessment [28,30]; leakage source localization (object detection) [29,31,37]; quantitative analysis of leakage sources and levels (semantic segmentation) [32,33,34,36,38]; and multi-source leakage analysis with level differentiation (instance segmentation) [35]. Notably, object detection, semantic segmentation, and instance segmentation all constitute pixel-level tasks, with instance segmentation presenting the most challenging requirements, as it necessitates the identification and differentiation of each individual object’s category and position within the image, while generating pixel-level masks and categorical labels for all detected objects.

3.1.4. Key Challenges of OGI-Based Gas Leak Detection

•: Low image resolution, low contrast, and lack of visible texture [28,29,30,34]. The video captured by infrared cameras is of single-channel grayscale images, with no significant color or texture contrast between the gas and the background. The blurred boundaries of the gas increase detection difficulty;
•: Complex background interference [31]. In complex environments such as chemical plants, non-leak sources like heat sources in the background, light variations, and ambient temperature fluctuations may act as background interference. These factors can affect the identification and localization of leak sources, further complicating detection;
•: Balancing the model’s light weight, high accuracy, and real-time performance. Detection models are typically deployed on edge devices, requiring a lightweight design while maintaining certain detection accuracy and speed to meet real-time monitoring needs [34,37,39];
•: Gas as a non-rigid target presents challenges in detection and annotation, along with a lack of large-scale standardized benchmark datasets [28,38,73]. The shape of gas plumes has no fixed pattern and evolves over time, making manual annotation standards difficult to unify. Detecting such plumes is more challenging than detecting rigid objects like vehicles, pedestrians, or animals. Currently, there is no industry-wide image or video benchmark dataset available; this limitation affects model reproducibility and comparability across studies.

OGI-based detection methods are broadly divided into visual and non-visual approaches. Visual methods are further subdivided into background subtraction, optical flow estimation, and deep-learning algorithms (such as CNN and vision transformers). To distinguish this from the multimodal approaches discussed in Section 4, this section focuses solely on the literature related to single-modal detection based on OGI. Table 2 presents the comparison of models and their evaluation metrics on the publicly available OGI detection dataset.

3.2. Background Subtraction

Background subtraction consists of two stages—background modeling and subtraction—where background modeling identifies the static background, while subtraction detects moving targets. Classical background subtraction methods first establish a background model based on the pixel values of the first N frames or consecutive frames in a video, then perform a subtraction operation between the current frame and the background model to obtain a grayscale image of the moving target region, and finally apply a threshold to extract the target. The adjacent-frame difference method, which relies solely on subtraction without background modeling, extracts moving regions by computing pixel differences between two consecutive or spaced frames. Its advantage lies in requiring only subtraction operations, enabling extremely fast computation, but it struggles to detect slow-moving targets. Gaussian mixture model-based background subtraction [77] constructs multiple Gaussian distributions for the background and determines whether a pixel belongs to the background for motion detection. For instance, if a pixel value falls within 1σ of a Gaussian distribution, it is considered part of the background. Otherwise, it is classified as a foreground point (moving target). Multiple Gaussian distributions can adapt to gradual illumination changes and periodic background disturbances, but parameters such as the learning rate and the number of models rely on empirical settings. Other commonly used background modeling methods include K-nearest neighbors for background subtraction [78], CodeBook [79], and ViBe (visual background extractor) [80]. Generally, background subtraction is suitable for scenarios with static or relatively stable dynamic backgrounds, whereas complex dynamic backgrounds, such as moving objects or lighting variations, make it difficult to accurately distinguish foreground from background, thereby compromising detection accuracy.

Lu et al. [25] proposed an online SF₆ gas leakage detection method based on a Gaussian mixture model. In the preprocessing stage, an improved dynamic inter-frame temporal filter is first employed to suppress Gaussian noise in the images. Next, contrast-limited adaptive histogram equalization is applied to enhance the dark-region details of infrared images, improving local contrast. Finally, an improved Gaussian mixture background model is utilized to adaptively segment SF₆ gas leakage regions and mark leakage locations. Experimental results demonstrate that, under test conditions with an indoor leakage rate of 0.06 mL/min and a distance of 5 m, the proposed algorithm effectively overcomes interference from high noise and complex backgrounds in infrared imaging, achieving reliable detection and localization of SF₆ leakage. Compared to conventional methods under the same conditions, the proposed approach achieves a higher F1 score. Furthermore, with an infrared image resolution of 320 × 240, the average processing speed exceeds 110 frames per second, meeting real-time processing requirements.

3.3. Optical Flow Estimation

Optical flow refers to the velocity of the brightness pattern motion in an image, and the optical flow field is a pixel-level, two-dimensional, and instantaneous velocity field. If there are no moving objects in the image, the optical flow vectors vary continuously across the entire image region. When moving objects are present, relative motion between the target and background causes different velocity vectors between the moving object and its surrounding background, enabling the detection of moving objects and their locations. Methods for calculating optical flow fields are primarily categorized into gradient-based, matching-based, energy-based, and phase-based approaches [81]. Gradient-based methods, the most extensively studied, compute optical flow using image intensity gradients and rely on the brightness constancy assumption as a prerequisite. Specifically, the intensity of a pixel (x, y) at time t is I(x, y, t). At time t + dt, the pixel moves to (x + dx, y + dy), with an intensity of I(x + dx, y + dy, t + dt).

I (x, y, t) = I (x + d x, y + d y, t + d t),

(2)

Performing a first-order Taylor series expansion on Equation (2):

I (x + d x, y + d y, t + d t) \approx I (x, y, t) + \frac{\partial I}{\partial x} d x + \frac{\partial I}{\partial y} d y + \frac{\partial I}{\partial t} d t,

(3)

Further simplification leads to Equation (4).

\frac{\partial I}{\partial x} \frac{d x}{d t} + \frac{\partial I}{\partial y} \frac{d y}{d t} + \frac{\partial I}{\partial t} = I_{x} u + I_{y} v + I_{t} = 0,

(4)

where

u = \frac{d x}{d t}

and

v = \frac{d y}{d t}

represent the motion velocities of the reference point along the x and y directions, respectively, i.e., the optical flow. Equation (3) is also referred to as the optical flow constraint equation. Since it contains two unknowns (u, v) in addition to t, it can only determine the motion component of the grayscale value along the gradient direction, while the motion component perpendicular to the gradient remains indeterminate. Mathematically, this is termed an underdetermined system of equations. By introducing additional constraints, the underdetermined problem can be transformed into a well-posed problem. For instance, Horn–Schunck introduces a global smoothness constraint, assuming the minimization of the integral of the squared velocity components within a given neighborhood [82]. Lucas–Kanade, on the other hand, employs a local window constraint, where all pixels within a neighborhood window around a given pixel share the same motion vector (u, v) [82].

Optical flow estimation in fluid structure analysis faces several detection challenges. First, conventional methods based on the brightness constancy assumption need to be adapted to a concentration constancy assumption, incorporating specific constraints such as gradient constancy or refractive index constancy, but the impact of these adjustments on flow velocity estimation has not been sufficiently studied. Second, differences in frame rates, resolutions, and noise levels among different camera types (e.g., visual cameras, OGI cameras) lead to inconsistencies in flow velocity estimation, particularly in high-noise environments caused by cooling system vibrations and radiation variations in OGI cameras, where gas flow velocity estimation becomes significantly more challenging. Additionally, classical algorithms such as Brox [83] and Farneback [84] are highly sensitive to parameter selection, with parameter settings at different frame rates and flow velocities directly affecting estimation accuracy. To address these issues, Shen et al. [26] constructed a comprehensive dataset through wind tunnel experiments, including ZED camera RGB images, OGI camera 16-bit grayscale images, and ultrasonic anemometer ref. velocities, systematically evaluating algorithm performance. In terms of method optimization, parameter tuning experiments on the Brox algorithm identified the optimal parameter combinations for different flow velocities and frame rates, validating their cross-frame-rate applicability. By comparing the Brox, Farneback, and deep-learning-driven Flownet2 algorithms, it was found that Flownet2 exhibits stronger robustness across varying frame rates and flow velocities. In the preprocessing stage, ZED color images were converted to grayscale difference images to ensure data consistency, and image quality was assessed using the contrast-to-noise ratio (CNR). A final analysis revealed that, at a low flow velocity of 0.7 m/s, all algorithms performed better on the OGI dataset, providing empirical evidence for optical flow algorithm selection in complex scenarios.

The primary advantages of background subtraction and optical flow estimation lie in their low computational cost. While they can be used to locate abnormal diffusion regions in gas leakage detection, they struggle to distinguish between real leaks and artifacts caused by thermal radiation interference. Additionally, continuous background updates are required to prevent model degradation. As a result, more accurate deep-learning algorithms are gradually replacing them. Currently, these two algorithms are often combined with deep-learning methods, serving as prior knowledge that is processed and then input into deep-learning models for further computation.

3.4. Convolutional Neural Network

Convolutional neural networks (CNNs) are defined as deep feedforward neural networks containing convolutional layers, inspired biologically by the concept of receptive fields [27]. As one of the most representative architectures in deep learning, CNN is specifically designed to process grid-structured data, such as images and videos, and has been widely applied in computer vision, natural language processing, recommendation systems, speech recognition, and biomedical fields.

CNN primarily achieves efficient extraction of hierarchical features through local connectivity, parameter sharing, and subsampling, where the parameter-sharing mechanism endows convolutional neural networks with translation equivariance [85]

f (T (x)) = T (f (x)),

(5)

where

f (x)

is the feature extraction function, and

T (x)

is the translation function. When the input image undergoes translation, the convolutional output feature map will exhibit an identical translational change, meaning that positional variations of objects within the image do not affect the detection results. This property is considered as main reason why CNN outperforms ViT on small-scale datasets, as it equips CNN with inherent prior knowledge.

The classic CNN primarily consists of convolutional layers, pooling layers, and fully connected layers [27]. The convolutional layers automatically extract and learn local information from the input data by applying multiple convolutional kernels, with each kernel responsible for detecting a specific feature, such as edges, lines, or textures. The pooling layer, also referred to as the subsampling layer, reduces the dimensionality of feature maps, decreases the number of parameters, mitigates overfitting, and retains critical feature information through operations like max pooling or average pooling. The fully connected layer establishes complete connections between all neurons of the preceding layer and each neuron of the current layer, enabling global information interaction, and maps high-dimensional features to the target output space via a weight matrix. Typically, networks incorporating such fully connected layers are termed fully connected convolutional networks, whereas those without fully connected layers, composed solely of convolutional and pooling layers, are referred to as fully convolutional networks.

3.4.1. Fully Connected Convolutional Networks

The fully connected convolutional network not only employs SoftMax but also directly outputs class probabilities, providing high computational efficiency and being well-suited for binary classification tasks, such as leak or non-leak detection. However, for object detection tasks requiring precise leak localization, classic CNN architectures must incorporate specialized frameworks like Faster R-CNN [86] or YOLO [87]. These models use fully connected layers for coordinate regression to pinpoint leak sources. Critically, however, fully connected layers flatten spatial data and discard pixel-level positional information, rendering them fundamentally unsuitable for semantic or instance segmentation tasks.

A representative study applying fully connected convolutional neural networks to methane leak detection is presented by Wang et al. [28] using controlled release experiments. The researchers collected 669,600 frames of infrared video data encompassing two leak sources, seven leak levels, and five imaging distances (4.6–15.6 m). Three binary classification schemes were tested, along with three background subtraction methods (fixed average, moving average, and Gaussian mixture model) for image preprocessing, compared against no background subtraction. The results showed that moving average background subtraction achieved 97% accuracy at distances of 4.6 and 6.9 m, significantly improving the detection performance for distant and small leaks. Figure 3 shows frames from videos in the GasVid dataset, captured at distances of 4.6 m (Figure 3a–d) and 6.9 m (Figure 3e–h), under different leak rates. In terms of model architecture, three CNN variants were evaluated, with GasNet-2 (featuring two pooling layers and two fully connected layers) demonstrating the best performance across varying distances and leak sizes, achieving 99% detection accuracy. Meanwhile, Shi et al. [29] applied the Faster R-CNN for real-time object detection of hydrocarbon leak sources. By optimizing the backbone feature extraction network and adjusting the number of region proposals, the model achieved a balance between inference speed and mean average precision (mAP), processing an image in 60 ms with a mAP of 0.71, and the mAP@0.5 reached 0.98. However, its performance in detecting small leak sources was suboptimal, with the mAP (small) at only 0.46.

Although gas flow patterns are not fixed, they exhibit significant continuity in both temporal and spatial dimensions. Leveraging this continuity, researchers have attempted to use consecutive frames as model input. Wang et al. [30] subsequently employed the same GasVid dataset in 2022, adopting a 3D CNN [88] for video classification and comparing the results with previous findings [28] and ConvLSTM. The results demonstrated that the 3D CNN was the most accurate architecture, subsequently named VideoGasNet. This model achieved 100% accuracy in binary classification (leak vs. non-leak detection) and reached a peak accuracy of 78.2% in small–medium–large classification tasks. However, in the more complex eight-class classification task, the accuracy dropped to 39.1%. Bin et al. [31] proposed a foreground fusion framework for LNG leak detection, which generates foreground images via online tensor decomposition. This approach enhances leak regions while mitigating the foreground loss issue prevalent in traditional background subtraction methods. The feature extraction network utilized deformable convolutional networks (DCNs) and feature pyramid networks (FPNs), both enabling multi-scale feature extraction. DCN, by introducing deformable convolutional kernels, adapts to targets of varying scales and shapes during feature extraction, addressing non-rigid object detection challenges. Object detection was performed using a cascaded region-of-interest (ROI) head, progressively refining bounding box localization and classification accuracy through multi-stage cascading. This detection model achieved a mAP of 0.49 on a proprietary infrared video dataset.

When dealing with small gas leaks, the performance of the aforementioned detection methods tends to decline, primarily because detecting minor leaks demands a higher spatial information retention capability from the model. Among these methods, 3D CNN, the ROI head of fast R-CNN, and cascaded ROI heads all incorporate fully connected layers. These layers flatten the input into a one-dimensional vector, losing the translational equivariance inherent in convolutional operations and consequently discarding spatial information. Although multi-scale feature extraction modules such as deformable convolutions, feature pyramids, and cascaded ROI heads can partially enhance the network’s detection performance for small leaks [31], detecting very minor leaks typically requires more advanced pixel-level tasks like semantic segmentation or instance segmentation. In such cases, fully connected layers are often replaced with convolutional layers, resulting in a fully convolutional network.

3.4.2. Fully Convolutional Networks

The fully convolutional network (FCN), first proposed by ref. [89] in 2015, is a deep-learning architecture specifically designed for pixel-level prediction tasks, such as semantic segmentation, and represents a significant milestone in the field of computer vision. To restore the output image dimensions, FCN first employs a fully convolutional structure that allows input images of arbitrary sizes and generates corresponding spatial heatmaps. It then utilizes deconvolution layers to upsample low-resolution feature maps, thereby recovering fine details and enabling end-to-end pixel-level prediction.

Subsequent models, such as U-Net [90], DeepLab [91], and PSPNet [92], are all inspired by it. U-Net achieved feature fusion by concatenating same-dimensional features at different levels, mitigating the issue of edge blurring caused by upsampling information loss in FCN. PSPNet proposed pyramid pooling to enhance the segmentation performance for small targets. The DeepLab series introduced dilated convolutions to address the limited receptive field of FCN, improving segmentation accuracy. Bhatt et al. [32] proposed a U-Net architecture extended to the temporal domain (spatio-temporal U-Net, ST-UNet), which captures and isolates low-level spatiotemporal patterns to generate accurate segmentation masks at the pixel level. Compared to traditional LSTM-based segmentation networks, ST-UNet demonstrated superior pixel-level accuracy, particularly in handling complex backgrounds and low-visibility plumes. Lin et al. [33] introduced a 2.5D-Unet-based infrared gas video segmentation method for gas leakage detection in chemical plants. This network combines spatial and temporal information by stacking 2D spatial convolutions, 1D temporal convolutions, and 3D spatiotemporal convolutions within an encoder–decoder architecture. The hybrid stacked convolutions not only enhanced the network’s representational capacity for leakage appearance and motion but also facilitated model pretraining on static smoke images followed by transfer learning on infrared videos, making this approach suitable for data-scarce scenarios like gas leakage detection. Xu et al. [34] proposed MWIRGas-YOLO, a gas leakage detection method based on MWIR imaging, to address the challenges of detecting and segmenting low-contrast, small-scale gas leaks in complex environments. The network architecture of MWIRGas-YOLO is shown in Figure 4. The study employed a cooled MWIR imaging system with high sensitivity, enabling better detection of subtle temperature and radiation differences. In the model architecture, a global attention mechanism was introduced during feature fusion to reduce background noise interference. The YOLOv8-seg model incorporated a spatial pyramid pooling fused layer and an additional small-target detection layer to further enhance the detection capability for low-concentration, small gas plumes. The experimental results showed that the model achieved mAP50 and mAP50:95 scores of 96.1% and 47.6%, respectively.

Compared to fully connected convolutional networks, FCNs also offer the advantages of fewer parameters and higher GPU parallel computing efficiency, making them frequently employed in real-time detections. Xiaojing et al. [35] proposed a real-time gas plume instance segmentation method based on the lightweight encoder framework ERFNet [93], which simultaneously achieves leakage detection, gas plume segmentation, and multi-leak source differentiation. By assuming that pixels within an instance follow a skewed elliptical distribution—i.e., a 2D Gaussian distribution—in the embedding space, the method addresses the uncertainty in plume morphology. Additionally, the researchers synthesized gas leakage images with complex backgrounds for model training and testing to enhance robustness. Wang et al. [36] introduced an improved infrared gas leakage semantic segmentation algorithm based on the DeeplabV3+ [94] for real-time industrial gas leakage detection. First, the researchers replaced the original backbone network, Xception [95], with MobileNetV2 [96] to reduce model parameters and improve computational speed. Second, they incorporated a DenseASPP [97] module to expand the model’s receptive field through multi-scale feature fusion, thereby enhancing gas segmentation accuracy. Finally, the dice loss function was employed to address class imbalance. The experimental results demonstrated that the improved model performed exceptionally well in gas leakage segmentation tasks, achieving IOU and F1-score improvements of 4.10% and 2.47%, respectively, surpassing traditional semantic segmentation models such as FCN, SegNet [98], and PSPNet [92]. However, the study utilized a relatively small dataset, with only 120 real-world images, augmented to 1148 infrared images via flipping, rotation, and Gaussian blur.

3.5. Vision Transformer

Vision transformer (ViT) [99] directly applies the transformer model from the NLP domain to images, initially designed for image classification tasks. Processing raw pixels as input would result in excessively long sequences, so ViT divides the input image into fixed-size patches, applies explicit positional encoding to each patch to preserve spatial information, and appends a learnable token to represent classification results. These elements are fed into a standard transformer encoder in the form of vector sequences and processed via multi-head self-attention (MHSA). Figure 5 illustrates the ViT architecture, where the encoder consists of multiple transformer blocks, each processing image patches after embedding. ViT’s attention mechanism requires pairwise computation of attention scores among input elements to form an attention matrix, which is then used to compute weighted averages of the input elements. Consequently, the computational complexity of the attention layer is O(n²), where n is the number of image patches, leading to high computational resource consumption when processing large-scale images.

The original ViT not only suffers from high computational complexity in attention calculation but also struggles to capture multi-granularity features (e.g., edges, textures, objects), primarily because it divides images into fixed-size patches, lacking hierarchical feature representation. Therefore, when applying ViT to gas leakage detection, optimization is required in model lightweighting and multi-scale feature extraction. Current optimization approaches mainly fall into four categories: model architecture design, attention mechanism optimization, model compression, and dynamic computation. Yu et al. [37] proposed a novel lightweight feature extraction network called GasViT for the real-time detection of invisible industrial gas leaks. Building upon ViT, GasViT incorporates an MsFFA module and an MhLSa module. The former enhances the local feature extraction capability in gas thermal imaging through multi-scale fusion, while the latter efficiently handles long-range dependencies via a multi-head linear self-attention mechanism, reducing the computational complexity of attention to O(n). The experimental results show that GasViT achieves 82.7% mAP50 and 59.7% mAP10:95 on a self-constructed IIG dataset, outperforming comparable frameworks, such as MobileViT [101], EfficientNet [102], and ConvNeXt-v2 [103]. Additionally, GasViT achieves 33 FPS real-time detection performance with 47.2 MB memory usage on edge computing devices, making it suitable for gas leakage detection in various industrial scenarios.

Sarker et al. [38] introduced an architecture called Gasformer for detecting and quantifying low-flow methane emissions from livestock using OGI. This model employs a mix vision transformer (MiT-B0) encoder and a Light-Ham decoder to generate multi-scale features and refine segmentation maps. The MiT-B0 encoder divides input images into overlapping patches and encodes local spatial relationships through 3 × 3 convolutions in a Mix-FFN (feed-forward network), replacing explicit positional encoding to improve local feature capture. During attention computation, MiT-B0 applies a sequence reduction process to decrease the spatial dimensions of the key matrix, reducing self-attention complexity to O(n²/r), where r is the reduction ratio. The Light-Ham decoder further models the global context to produce more accurate methane plume segmentation maps. The experimental results demonstrate that Gasformer outperforms state-of-the-art segmentation models on both the methane release (MR) and cow rumen (CR) datasets, particularly excelling in low-contrast scenarios. On the CR dataset, Gasformer achieves 88.56% mIoU, with a parameter count of 3.652 M and a GPU inference speed of 97.45 FPS, surpassing other comparable models.

3.6. Non-Visual Methods

The aforementioned research methods discussed in this section are all based on visual recognition techniques. As a matter of fact, infrared or OGI imaging produces single-channel grayscale images, which can be processed through dimensionality reduction and feature compression to enable recognition using machine-learning classification algorithms. Jing et al. [39] proposed a thermal-imaging detection algorithm for leaking gas clouds based on scale-invariant feature transform (SIFT) and SVM, capable of detecting gas leaks such as methane and ethylene. First, the collected infrared images are processed using the VLFeat software package(https://www.vlfeat.org (accessed on 17 August 2025)) to extract SIFT features from the samples, generating a 128×N feature matrix for each sample, where each 128-dimensional feature vector is treated as a visual word. Subsequently, all visual words are aggregated, and a vocabulary is constructed using the K-means clustering algorithm. Finally, based on the frequency of visual words in the images, SVM is employed for binary classification (leak vs. non-leak). During the experiment, a grid search was used to optimize the SVM parameters, achieving a detection accuracy of 92.5% for gas clouds. Compared to classifiers based on histogram features, the SIFT-based classifier demonstrated lower training costs and superior classification performance.

4. Multimodal Fusion Detection System

4.1. Consistency, Complementarity, and Compatibility

Multimodal fusion refers to a technique that integrates information from diverse modalities, such as text, images, audio, video, and sensor data, to enhance a model’s perception and reasoning capabilities for complex tasks. These modalities typically exhibit semantic redundancy and complementarity, meaning that, under ideal conditions, multimodal fusion offers greater robustness compared to single-modality approaches. It is important to note that more fused data does not necessarily yield better results. Instead, the fusion must satisfy three key criteria: consistency, complementarity, and compatibility [104]. Consistency requires that data from different modalities mutually support each other at the semantic or logical level without contradiction. Complementarity means each modality provides unique information, collectively forming a more comprehensive understanding, and compatibility ensures that data from different modalities can be effectively aligned and interact in representation space or structure.

In gas leak detection scenarios, the commonly fused data types currently include infrared–visible light fusion, infrared–gas sensor fusion, and acoustic–negative pressure wave fusion. OGI leverages the absorption characteristics of gases in the infrared spectrum to visualize otherwise invisible gases, while visible light can compensate for infrared thermal interference and the lack of texture in infrared imaging [40,41,42,43]. Thus, combining the two modalities can effectively reduce false alarms caused by thermal interference. If the absorption peaks of two gases in OGI exhibit low separation or even overlap, such as in the case of CO₂ and NO₂ in certain spectral bands, auxiliary sensors can be employed for differentiation [44,45,46,47,48,49,50,51,52,53]. Acoustic detection is highly sensitive to minor leaks but suffers from limited propagation distance, whereas negative pressure waves, though less sensitive to minor leaks, are often used for high-flow leak detection due to their long propagation range and ability to provide additional information about leak location and magnitude. The complementary nature of these two modalities enhances the overall performance of the detection system [21].

4.2. Classification of Multimodal Fusion

Multimodal fusion is commonly classified into early fusion, late fusion, and hybrid fusion based on the stage at which fusion occurs [100,104]. Early fusion involves directly merging data or feature vectors from different modalities at the model’s input or shallow layers, encompassing input-level and feature-level fusion. Input-level fusion concatenates raw data into a larger tensor without specialized processing, while feature-level fusion extracts feature vectors from different modalities and integrates them during the feature extraction phase to generate a unified cross-modal representation before task-specific decisions. Late fusion, also known as decision-level fusion, combines independent predictions from each modality during the decision-making stage using methods such as voting, weighted averaging, stacking, or other ensemble algorithms. Its advantage lies in the ability to train single-modality models independently and decouple them from the fusion phase, ensuring compatibility with missing modalities. However, errors from multiple models are typically uncorrelated [40,105]. Hybrid fusion combines early and late fusion by alternately applying both strategies at different stages. Since input-level fusion lacks inter-modal feature interaction—merely aligning modalities through concatenation—it struggles with complex heterogeneous data, rendering early fusion nearly obsolete in practice. Decision-level fusion also omits feature interaction but may outperform feature-level fusion when inter-modal dependencies are weak or absent. Feature-level fusion effectively addresses heterogeneity across modalities, such as differences in feature dimensions or semantic granularity, but generally requires all modalities to be simultaneously available. Hybrid fusion balances early-stage feature alignment with late-stage decision optimization, accommodating partial modality absence or noise, yet demands careful design of fusion architectures and hyperparameters, potentially increasing training time due to multi-stage integration. Figure 5 is a schematic diagram of feature fusion, decision fusion, and hybrid fusion.

4.3. Feature-Level Fusion

Following the classification method in ref. [106], we categorize feature fusion into encoder–decoder architectures, attention mechanisms, and generative approaches.

4.3.1. Encoder–Decoder Fusion

The encoder–decoder architecture is a universal framework for sequence-to-sequence tasks, with one of the most influential studies being the RNN-based encoder–decoder (Seq2Seq) proposed by Ilya Sutskever et al. in 2014 [74]. The Seq2Seq encoder compresses input sequences into fixed-dimensional, vector-represented semantic information, while the decoder autoregressively generates target sequences based on the context vector output by the encoder. However, due to its RNN-based architecture [107], Seq2Seq suffers from long-sequence information loss, a limitation effectively addressed by attention mechanisms [108], leading to their prevalent combined usage in current research.

The encoder–decoder fusion method maps multimodal data into a shared feature space through encoders, then integrates and generates final outputs via decoders. Wang et al. [73] proposed a dual-stream visible–thermal cross attention network (RT-CAN) based on this framework, which enhances thermal-imaging information through RGB image assistance to improve detection accuracy for invisible gases like methane, carbon dioxide, ammonia, and hydrogen sulfide under visible light. RT-CAN employs an encoder–decoder structure where the encoder utilizes ResNet50/152 [109] backbones with RGB-assisted cross attention (RCA) to fuse thermal and RGB features at intermediate ResNet layers, outputting five fused feature maps to the decoder. The decoder processes these through global textual attention (GTA) and aggregation modules to produce dual prediction maps, with post-processing enabling leak/non-leak classification and segmentation. The GTA module implements multiscale feature extraction via multi-kernel convolutional operations and self-channel attention mechanisms, strengthening contextual information capture. RT-CAN outperforms single-stream SOTA models by 4.86% in accuracy, 5.65% in IoU, and 4.88% in F2-score.

4.3.2. Attention Mechanism Fusion

The attention mechanism dynamically focuses on key data components through automatic weight adjustment, originally proposed in natural language processing and subsequently introduced to computer vision. Local attention mechanisms concentrate on regional details while global attention mechanisms require comprehension of holistic contexts and scenarios. In gas leak detection applications, local attention addresses minor leak source localization, whereas global attention identifies complex background information, with implementations primarily through CNNs and ViTs.

Traditional convolutional neural networks capture spatial features through local receptive fields and weight sharing, where the fixed-weight computation of convolutional kernels can be regarded as a static, content-agnostic approximation of local attention, lacking the ability to dynamically adjust regions of interest. Therefore, one approach involves making convolutional kernels “dynamic” within local regions. For instance, Deformable ConvNets [110] achieve deformation-aware local perception through offset prediction. Dynamic convolution [111] dynamically generates convolutional kernel parameters for each input sample, enabling convolution weights to adapt to content changes for dynamic focus akin to attention. The CBAM [112] enhances responses in important regions through a gating mechanism while maintaining local computation scope. Global attention mechanisms, on the other hand, require expanding the receptive field, such as dilation convolution [113], which uses large-sized convolutional kernels with skip sampling to indirectly cover larger areas. SENet [114] acquires global channel-wise information via global average pooling and then generates channel weights through fully connected layers to emphasize important channels. Non-local networks [115] implement a global attention mechanism by computing association weights between each position and all other positions on the feature map. The integration of local and global attention is achieved through pyramid structures [116], where shallow layers use local attention to capture fine details, while deeper layers employ downsampled feature maps for lightweight global computation.

Yan et al. [21] developed a multimodal fusion CNN classification model that integrates data from acoustic, negative-pressure wave, and flow sensors to improve the detection efficiency of minor leaks in natural gas pipelines while reducing false positives and false negatives in fault detection. To leverage pretrained convolutional image models, researchers converted collected leakage signals into GADF images for network input. This high-dimensional spatial transformation representation enhances temporal feature extraction, thereby revealing more hidden information [117]. The model incorporates a dual information fusion (DIF) module and a channel-split multi-scale convolution (CSMC) module to combine attention mechanisms with multiscale feature fusion, enhancing global perception and multiscale feature representation while effectively reducing parameter counts and computational complexity. Kang et al. [41] proposed a dual-stream fusion detection framework to process visible and infrared features of VOCs. Since targets exist only in infrared images while visible images serve as supplementary information, the researchers designed modal attention (MA) and cross-modal interaction (IAM) modules for foreground feature selection and background interference elimination. Visible and infrared feature fusion is achieved using a spatial attention fusion module (SAFM), with a modal adapter increasing the weight of infrared images to better extract potential VOC features. Li et al. [42] introduced a method based on a convolutional block multi-attention fusion network (CBMAFNet) to fuse hyperspectral and thermal-imaging data for detecting underground natural gas microleaks. CBMAFNet, through channel attention mechanisms, spatial attention mechanisms, and 3D convolutional neural networks [13], achieved an average accuracy of 94.60%, outperforming single-sensor data models and conventional CNN models. Liang et al. [118] proposed a pipeline process model based on a CNN-BiLSTM-AM framework, using SCADA (supervisory control and data acquisition) data such as flow, pressure, and temperature for pipeline monitoring. In this model, CNN and BiLSTM are used to extract spatial and temporal features, respectively, and the concatenated features are fed into an attention module (AM) to enhance sensitivity to critical features. Additionally, the researchers employed a K-means clustering-based boundary determination method to improve detection capability for minor leakage events.

The ViT divides an image into a sequence of patches, achieving global interaction through a self-attention mechanism, where each patch establishes connections with all other patches, thereby enhancing the model’s ability to comprehend overall structures. Bin et al. [40] proposed a visual Fourier transform framework (VFTED) that integrates visible and infrared light for ethane detection. During the fusion stage of visible and infrared light, an FPN [116] is employed to extract multi-scale features, and the FFT replaces the multi-head attention mechanism (MHA). The underlying principle treats the image frequency F(u, v) as global attention, determining the distribution density of planar waves (serving as tokens) within the image. The improved attention mechanism reduces computational complexity to O(n log(n)). To enable neural networks to process complex numbers derived from the Fourier transform, the paper also designs a Fourier multilayer perceptron (FMLP).

4.3.3. Generative Fusion

Generative fusion aims to supplement and enhance existing datasets by capturing and replicating the underlying data distribution, which plays a crucial role when certain modalities are difficult to obtain or missing [106]. For instance, VOCs are challenging to collect in infrared imaging due to their flammability, explosiveness, and toxicity, whereas visible-light smoke images are more accessible. Generative models can combine these modalities to produce fused images, thereby leveraging models pre-trained on large-scale datasets in visible light. Wang Y. et al. (2023) [43] employed CycleGAN to transform smoke images under natural light into infrared images, generating the ComplexGasVid dataset. This dataset comprises approximately 4000 frames with complex backgrounds and various interferences to improve model detection performance in challenging environments. Architecturally, researchers enhanced the traditional faster R-CNN by incorporating optical flow convolution to extract motion information from optical flow images. By integrating texture and motion information, the flow faster R-CNN network demonstrated superior gas leak detection accuracy compared to faster R-CNN, achieving mAP scores of 44.3% and 61.9%, respectively—a 15.6% improvement in average precision with the addition of optical flow information. Wang Q. et al. (2024) [119] proposed an intelligent gas leak detection method named YOLOGAS based on YOLOv5 [44] to enhance the automation and accuracy of industrial gas leak detection. Researchers first converted 18,546 pairs of infrared–visible images into infrared images using the Pix2pix [45]. By integrating the Swin transformer, attention mechanisms (CBAM and ECA), and an improved pyramid structure (BiFPN), the model’s perception and feature fusion capabilities for gas targets were strengthened. Experimental results showed that YOLOGAS achieved AP50 and mAP50:95 scores of 88.07% and 53.70%, respectively, outperforming other mainstream detection models while maintaining fast recognition speed.

The synthesis based on simulation software or gas plume mechanism models requires superimposing the generated gas images as foreground onto background images to enhance the model’s robustness against complex background interference. For instance, the simulation software Blender was used to automatically generate gray smoke plume instances as the foreground, which were then superimposed onto infrared images from multiple open-source datasets [35], such as Kaist [46], with random transparency to synthesize the data.

4.4. Decision-Level Fusion

A representative study on decision-level fusion was conducted by Narkhede et al., which combined MOS sensors with infrared imaging [47,48]. MOS sensors enable the detection of low-concentration and mixed gases, while infrared imaging compensates for long-range identification. Researchers employed LSTM and CNN to predict detection results from MOS sensors and IR imaging, respectively. Both early fusion (direct concatenation of MOS sensor readings with IR images) and decision-level fusion (averaging and maximizing prediction probabilities) were evaluated. Results demonstrated that both fusion strategies achieved 96% accuracy, surpassing single-modality performances (IR: 93%; MOS: 83%). While decision-level fusion enables independent output generation when either IR or MOS modalities are absent, quantitative experimental validation under modality absence scenarios was not conducted by the researchers. Subsequently, Narkhede et al. released the open-source dataset MultimodalGasData [48], comprising 6400 samples across four categories, namely smoke, perfume, a mixture of smoke and perfume, and neutral environments, with 1600 samples per category. This dataset aims to assist researchers and system developers in training and developing AI models and systems.

Based on the open-source dataset MultimodalGasData, Zhang et al. [49] employed AlexNet [120], ResNet-50 [109], and MobileNet [121] for deep feature extraction and replaced the decision algorithm with the ensemble algorithm deep forest, improving the single-modal accuracy to 73.6% and the fused accuracy to 98.5%. El Barkani et al. [50] utilized MobileNetV1 [121] and EfficientNet-B0 [122] to achieve lightweight edge deployment. Sharma et al. [51] explored the application of federated learning for privacy protection. Rahate et al. [52] compared intermediate fusion and multi-task fusion methods, with the experimental results demonstrating that multi-task fusion performs better when handling missing and noisy modal data. Attallah [53] proposed a deep-learning-based pipeline that employed discrete wavelet transform (DWT) and discrete cosine transform (DCT) to improve intermediate fusion and multi-task fusion, respectively, and combined the two approaches.

5. Discussion

Here, we wish to supplement some important content not formally described in this paper:

(1)

In engineering practice, “acoustic signal visualization” is often broadly termed “acoustic imaging” [20,123]. However, strictly defined acoustic-imaging techniques require spatial resolution capabilities for physical location mapping [124]. To ensure scientific terminology usage, this study adheres to precise technical definitions, clearly distinguishing between the two concepts: acoustic imaging specifically refers to sound field reconstruction technologies with spatial resolution capabilities, while acoustic signal visualization encompasses broader signal transformation and representation methods;

(2)

This paper did not differentiate between real-time monitoring and shutdown maintenance. Taking acoustic detection technologies as an example:

•: Acoustic emission (AE), as a passive monitoring technique, relies on elastic waves released by structural damage itself. It is suitable for real-time condition monitoring during equipment operation, but its localization accuracy is constrained by wave velocity calculation errors and sensor placement [7,14,18,19,22,23,66];
•: Ultrasonic guided waves (UGW) actively excite low-frequency ultrasonic waves, offering sub-millimeter detection sensitivity for internal defects, such as pipeline cracks or corrosion. However, due to the need to control the excitation energy and avoid interference, UGW is typically implemented during equipment shutdown maintenance [8,16,17].

(3)

Monitoring key thermodynamic and state parameters (e.g., temperature, flow rate, pressure/negative pressure waves) is effective for identifying large-diameter leaks, though localization is typically limited to a segmental range [125]. However, these methods, especially conventional threshold-based approaches, are generally inadequate or unreliable for detecting micro- and small-diameter leaks due to the subtlety of the signal changes relative to system noise. The detection of micro-leaks often requires more advanced, sensitive, and costly techniques [126];

(4)

As ref. [106] mentioned, graph neural networks (GNNs) can serve as effective tools for feature fusion. However, no relevant research has been reported in the field of pipeline gas leak detection thus far. The existing literature only applied GNN to the topological modeling of complex pipeline networks under negative-pressure wave (NPD) detection scenarios [127]. This technical gap may stem from the inherent modeling constraints of GNN—their core relies on “node-edge” structures to define entity relationships. Multimodal gas leak detection requires the fusion of heterogeneous data sources, which conventional graph structures struggle to directly represent for cross-modal feature interactions. Therefore, breaking away from node-centric modeling paradigms and exploring novel graph representation frameworks for non-structured data patterns will be a key research direction for advancing GNN applications in multimodal leak detection.

6. Conclusions

This paper has reviewed gas leak detection methods based on acoustics, OGI, and multimodal fusion. Acoustic detection technology features rapid response and broad coverage, making it suitable for detecting micro-leaks in high-pressure pipelines. However, it is prone to environmental noise interference, particularly in complex industrial settings. Current improvements primarily focus on enhancing time–frequency feature extraction capabilities through deep learning.

OGI technology visualizes gas leaks via infrared imaging, offering high sensitivity for preliminary large-area, long-distance monitoring. Nevertheless, it suffers from blurred gas boundaries and vulnerability to thermal radiation interference, complicating the differentiation between actual leaks and heat-induced artifacts. Most current studies integrate attention mechanisms into infrared recognition models or fuse infrared with visible light, both demonstrating promising results.

Multimodal fusion combines data from acoustic, optical, and other sensors to enhance detection robustness and accuracy via deep-learning algorithms. However, significant gaps persist, including (1) the need for lightweight, real-time models suitable for edge deployment; (2) a lack of scalable sensor deployment frameworks; (3) adaptive-learning techniques for dynamic environments; and (4) unresolved challenges in spatiotemporal synchronization and heterogeneous data fusion. Moving forward, critical research priorities could focus on:

•: Advanced fusion architectures: Transformer-based networks, self-supervised learning paradigms, and digital twin-integrated systems;
•: Foundation development: Large-scale benchmark datasets, optimized feature extraction, and multi-sensor diagnostics;
•: Deployment optimization: Hardware-aware model lightweighting and cross-modal calibration frameworks.

Author Contributions

Writing—original draft preparation, Y.G.; writing—review and editing, X.W., H.H., and X.S.; Supervision, Z.H., C.B., and Y.J.; funding acquisition, C.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by NPIC’s Sustainable Support Project for the State Key Laboratory of Advanced Nuclear Energy Technology, grant number STSW-0224-0404-05.

Conflicts of Interest

Please add the corresponding content of this part.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
AE	Acoustic Emission
AM	Attention Module
BiGRU	Bidirectional Gated Recurrent Unit
BiLSTM	Bidirectional Long Short-Term Memory
BiFPN	Bidirectional Feature Pyramid Network
CBAM	Convolutional Block Attention Module
CNN	Convolutional Neural Network
CNR	Contrast-to-Noise Ratio
ConvLSTM	Convolutional Long Short-Term Memory
CR	Cow Rumen
CWT	Continuous Wavelet Transform
DCN	Deformable Convolutional Network
DCT	Discrete Cosine Transform
DenseASPP	Dense Atrous Spatial Pyramid Pooling
DWT	Discrete Wavelet Transform
ECA	Efficient Channel Attention
FCN	Fully Convolutional Network
FCNN	Fully Convolutional Neural Network
FPA	Focal Plane Array
FPN	Feature Pyramid Network
FPS	Frames Per Second
FRFT	Fractional Fourier Transform
GADF	Gramian Angular Difference Field
GAN	Generative Adversarial Network
GNN	Graph Neural Network
GSP	Galvanized Steel Pipe
IIG	Industrial Invisible Gas
IGS	Infrared Gas Segmentation
IoU	Intersection Over Union
IR	Infrared
LBB	Leak Before Break
LLE	Locally Linear Embedding
LNG	Liquefied Natural Gas
LWIR	Long-Wave Infrared
mAP	Mean Average Precision
MTF	Markov Transition Field
MEKF	Markov Extended Kalman Filter
MOS	Metal Oxide Semiconductor
MR	Methane Release
MhLSa	Multi-head Linear Self-Attention
MsFFA	Multi-scale Fusion Feature Attention
MWIR	Mid-Wave Infrared
NI	National Instruments
NLP	Natural Language Processing
NPP	Nuclear Power Plant
OGI	Optical Gas Imaging
ppm	Parts Per Million
PSO	Particle Swarm Optimization
QOGI	Quantitative Optical Gas Imaging
RBFN	Radial Basis Function Network
ROI	Region of Interest
RP	Recurrence Plot
SE	Spectral Enhancement
SENet	Squeeze-and-Excitation Network
SIFT	Scale-Invariant Feature Transform
SNR	Signal-to-Noise Ratio
SOTA	State of The Art
STFT	Short-Time Fourier Transform
S-transform	Stockwell transform
SVM	Support Vector Machines
SWIR	Short-Wave Infrared
UAV	Unmanned Aerial Vehicle
UGW	Ultrasonic Guided Waves
VGGNet	Visual Geometry Group Network
VMD	Variational Mode Decomposition
ViBe	Visual Background Extractor
ViT	Vision Transformer
VOC	Volatile Organic Compound
WOA	Whale Optimization Algorithm
WT	Wavelet Transform
YOLO	You Only Look Once

References

Ding, Q.; Yang, Z.G.; Zheng, H.L.; Lou, X. Failure analysis on abnormal leakage of TP321 stainless steel pipe in hydrogen-eliminated system of nuclear power plant. Eng. Fail. Anal. 2018, 89, 286–292. [Google Scholar] [CrossRef]
Meribout, M.; Khezzar, L.; Azzi, A.; Ghendour, N. Leak detection systems in oil and gas fields: Present trends and future prospects. Flow Meas. Instrum. 2020, 75, 101772. [Google Scholar] [CrossRef]
Zou, S. Panjin Officials Punished After Chemical Plant Explosion. China Daily. 2024. Available online: https://www.chinadaily.com.cn/a/202410/28/WS671f593ca310f1265a1ca0aa.html (accessed on 17 August 2025).
Ray, B. Malaysia Gas Pipeline Fire: An In-Depth Look at the Puchong Blaze. Breslin Media. 2025. Available online: https://breslin.media/safety-lessons-puchong-gas-fire (accessed on 17 August 2025).
Li, C.; Yang, M. Applicability of Leak-Before-Break (LBB) Technology for the Primary Coolant Piping System to CPR1000 Nuclear Power Plants in China. In Proceedings of the 20th International Conference on Structural Mechanics in Reactor Technology (SMiRT 20), Espoo, Finland, 9–14 August 2009. [Google Scholar]
Fiates, J.; Vianna, S.S. Numerical modelling of gas dispersion using OpenFOAM. Process Saf. Environ. Prot. 2016, 104, 277–293. [Google Scholar] [CrossRef]
Zhang, Z.; Xu, C.; Xie, J.; Zhang, Y.; Liu, P.; Liu, Z. MFCC-LSTM framework for leak detection and leak size identification in gas-liquid two-phase flow pipelines based on acoustic emission. Measurement 2023, 219, 113238. [Google Scholar] [CrossRef]
Jiang, Q.; Qu, W.; Xiao, L. Pipeline damage identification in nuclear industry using a particle swarm optimization-enhanced machine learning approach. Eng. Appl. Artif. Intell. 2024, 133, 108467. [Google Scholar] [CrossRef]
Cai, B.P.; Yang, C.; Liu, Y.H.; Kong, X.D.; Gao, C.T.; Tang, A.B.; Liu, Z.K.; Ji, R.J. A data-driven early micro-leakage detection and localization approach of hydraulic systems. J. Cent. South Univ. 2021, 28, 1390–1401. [Google Scholar] [CrossRef]
Lu, J.; Fu, Y.; Yue, J.; Zhu, L.; Wang, D.; Hu, Z. Natural gas pipeline leak diagnosis based on improved variational modal decomposition and locally linear embedding feature extraction method. Process Saf. Environ. Prot. 2022, 164, 857–867. [Google Scholar] [CrossRef]
Xiao, Q.; Li, J.; Sun, J.; Feng, H.; Jin, S. Natural-gas pipeline leak location using variational mode decomposition analysis and cross-time–frequency spectrum. Measurement 2018, 124, 163–172. [Google Scholar] [CrossRef]
Yao, L.; Zhang, Y.; He, T.; Luo, H. Natural gas pipeline leak detection based on acoustic signal analysis and feature reconstruction. Appl. Energy 2023, 352, 121975. [Google Scholar] [CrossRef]
Yao, L.; Zhang, Y.; Wang, L.; Li, R.; He, T. Natural Gas Pipeline Leak Detection Based on Dual Feature Drift in Acoustic Signals. IEEE Trans. Ind. Inform. 2024, 21, 1950–1959. [Google Scholar] [CrossRef]
Han, X.; Zhao, S.; Cui, X.; Yan, Y. Localization of CO₂ gas leakages through acoustic emission multi-sensor fusion based on wavelet-RBFN modeling. Meas. Sci. Technol. 2019, 30, 085007. [Google Scholar] [CrossRef]
Xiao, R.; Hu, Q.; Li, J. Leak detection of gas pipelines using acoustic signals based on wavelet transform and Support Vector Machine. Measurement 2019, 146, 479–489. [Google Scholar] [CrossRef]
Li, S.; Zhang, J.; Yan, D.; Wang, P.; Huang, Q.; Zhao, X.; Cheng, Y.; Zhou, Q.; Xiang, N.; Dong, T. Leak detection and location in gas pipelines by extraction of cross spectrum of single non-dispersive guided wave modes. J. Loss Prev. Process Ind. 2016, 44, 255–262. [Google Scholar] [CrossRef]
Li, S.; Han, M.; Cheng, Z.; Xia, C. Multi-modal identification of leakage-induced acoustic vibration in gas-filled pipelines by selection of coherent frequency band. Int. J. Press. Vessel. Pip. 2020, 188, 104193. [Google Scholar] [CrossRef]
Zhang, Z.; Huang, J.; Yu, Y.; Qin, R.; Wang, J.; Zhang, S.; Su, Y.; Wen, G.; Cheng, W.; Chen, X. Microleakage acoustic emission monitoring of pipeline weld cracks under complex noise interference: A feasible framework. J. Sound Vib. 2025, 604, 118980. [Google Scholar] [CrossRef]
Huang, Q.; Shi, X.; Hu, W.; Luo, Y. Acoustic emission-based leakage detection for gas safety valves: Leveraging a multi-domain encoding learning algorithm. Measurement 2025, 242, 116011. [Google Scholar] [CrossRef]
Yuan, Y.; Cui, X.; Han, X.; Gao, Y.; Lu, F.; Liu, X. Multi-condition pipeline leak diagnosis based on acoustic image fusion and whale-optimized evolutionary convolutional neural network. Eng. Appl. Artif. Intell. 2025, 153, 110886. [Google Scholar] [CrossRef]
Yan, W.; Liu, W.; Zhang, Q.; Bi, H.; Jiang, C.; Liu, H.; Wang, T.; Dong, T.; Ye, X. Multisource multimodal feature fusion for small leak detection in gas pipelines. IEEE Sens. J. 2023, 24, 1857–1865. [Google Scholar] [CrossRef]
Saleem, F.; Ahmad, Z.; Siddique, M.F.; Umar, M.; Kim, J.M. Acoustic Emission-Based pipeline leak detection and size identification using a customized One-Dimensional densenet. Sensors 2025, 25, 1112. [Google Scholar] [CrossRef]
Quy, T.B.; Kim, J.M. Leak detection in a gas pipeline using spectral portrait of acoustic emission signals. Measurement 2020, 152, 107403. [Google Scholar] [CrossRef]
Zuo, J.; Zhang, Y.; Xu, H.; Zhu, X.; Zhao, Z.; Wei, X.; Wang, X. Pipeline leak detection technology based on distributed optical fiber acoustic sensing system. IEEE Access 2020, 8, 30789–30796. [Google Scholar] [CrossRef]
Lu, Q.; Li, Q.; Hu, L.; Huang, L. An effective Low-Contrast SF₆ gas leakage detection method for infrared imaging. IEEE Trans. Instrum. Meas. 2021, 70, 5009009. [Google Scholar] [CrossRef]
Shen, Z.; Schmoll, R.; Kroll, A. Measurement of fluid flow velocity by using infrared and visual cameras: Comparison and evaluation of optical flow estimation algorithms. In Proceedings of the 2023 IEEE Sensors, Vienna, Austria, 29 October–1 November 2023; IEEE: New York, NY, USA, 2023; pp. 1–4. [Google Scholar]
Rawat, W.; Wang, Z. Deep convolutional neural networks for image classification: A comprehensive review. Neural Comput. 2017, 29, 2352–2449. [Google Scholar] [CrossRef]
Wang, J.; Tchapmi, L.P.; Ravikumar, A.P.; McGuire, M.; Bell, C.S.; Zimmerle, D.; Savarese, S.; Brandt, A.R. Machine vision for natural gas methane emissions detection using an infrared camera. Appl. Energy 2020, 257, 113998. [Google Scholar] [CrossRef]
Shi, J.; Chang, Y.; Xu, C.; Khan, F.; Chen, G.; Li, C. Real-time leak detection using an infrared camera and Faster R-CNN technique. Comput. Chem. Eng. 2020, 135, 106780. [Google Scholar] [CrossRef]
Wang, J.; Ji, J.; Ravikumar, A.P.; Savarese, S.; Brandt, A.R. VideoGasNet: Deep learning for natural gas methane leak classification using an infrared camera. Energy 2022, 238, 121516. [Google Scholar] [CrossRef]
Bin, J.; Bahrami, Z.; Rahman, C.A.; Du, S.; Rogers, S.; Liu, Z. Foreground fusion-based liquefied natural gas leak detection framework from surveillance thermal imaging. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 7, 1151–1162. [Google Scholar] [CrossRef]
Bhatt, R.; Gokhan Uzunbas, M.; Hoang, T.; Whiting, O.C. Segmentation of low-level temporal plume patterns from IR video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Lin, H.; Gu, X.; Hu, J.; Gu, X. Gas leakage segmentation in industrial plants. In Proceedings of the 2020 Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020; IEEE: New York, NY, USA, 2020; pp. 1639–1644. [Google Scholar]
Xu, S.; Wang, X.; Sun, Q.; Dong, K. MWIRGas-YOLO: Gas Leakage Detection Based on Mid-Wave Infrared Imaging. Sensors 2024, 24, 4345. [Google Scholar] [CrossRef]
Xiaojing, G.U.; Haoqi, L.; Dewu, D.; Xingsheng, G.U. An infrared gas imaging and instance segmentation based gas leakage detection method. J. East China Univ. Sci. Technol. 2023, 49, 76–86. [Google Scholar]
Wang, Q.; Xing, M.; Sun, Y.; Pan, X.; Jing, Y. Optical gas imaging for leak detection based on improved deeplabv3+ model. Opt. Lasers Eng. 2024, 175, 108058. [Google Scholar] [CrossRef]
Yu, H.; Wang, J.; Wang, Z.; Yang, J.; Huang, K.; Lu, G.; Deng, F.; Zhou, Y. A lightweight network based on local–global feature fusion for real-time industrial invisible gas detection with infrared thermography. Appl. Soft Comput. 2024, 152, 111138. [Google Scholar] [CrossRef]
Sarker, T.T.; Embaby, M.G.; Ahmed, K.R.; AbuGhazaleh, A. Gasformer: A transformer-based architecture for segmenting methane emissions from livestock in optical gas imaging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 5489–5497. [Google Scholar]
Jing, W.; Pan, Y.; Minghe, W.; Li, L.; Weiqi, J.; Wei, C.; Bingcai, S. Thermal imaging detection method of leak gas clouds based on support vector machine. Acta Opt. Sin. 2022, 42, 0911002. [Google Scholar]
Bin, J.; Rogers, S.; Liu, Z. Vision Fourier transformer empowered multi-modal imaging system for ethane leakage detection. Inf. Fusion 2024, 106, 102266. [Google Scholar] [CrossRef]
Kang, Y.; Shi, K.; Tan, J.; Cao, Y.; Zhao, L.; Xu, Z. Multimodal Fusion Induced Attention Network for Industrial VOCs Detection. IEEE Trans. Artif. Intell. 2024, 5, 6385–6398. [Google Scholar] [CrossRef]
Li, K.; Xiong, K.; Jiang, J.; Wang, X. A convolutional block multi-attentive fusion network for underground natural gas micro-leakage detection of hyperspectral and thermal data. Energy 2025, 319, 134870. [Google Scholar] [CrossRef]
Wang, Y.; Huang, L.; Cheng, Z.; Xu, J.; Li, Q. Flow Faster RCNN: Deep Learning Approach for Infrared Gas Leak Detection in Complex Chemical Plant Surroundings. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; IEEE: New York, NY, USA, 2023; pp. 7823–7830. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R.; et al. Ultralytics/yolov5: v3.0; Zenodo: Geneva, Switzerland, 2020. [Google Scholar]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
Narkhede, P.; Walambe, R.; Mandaokar, S.; Chandel, P.; Kotecha, K.; Ghinea, G. Gas detection and identification using multimodal artificial intelligence based sensor fusion. Appl. Syst. Innov. 2021, 4, 3. [Google Scholar] [CrossRef]
Narkhede, P.; Walambe, R.; Chandel, P.; Mandaokar, S.; Kotecha, K. MultimodalGasData: Multimodal dataset for gas detection and classification. Data 2022, 7, 112. [Google Scholar] [CrossRef]
Zhang, E.; Zhang, E. Development of A Multimodal Deep Feature Fusion with Ensemble Learning Architecture for Real-Time Gas Leak Detection. In Proceedings of the 2024 IEEE 3rd International Conference on Computing and Machine Intelligence (ICMI), Mt Pleasant, MI, USA, 13–14 April 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
El Barkani, M.; Benamar, N.; Talei, H.; Bagaa, M. Gas Leakage Detection Using Tiny Machine Learning. Electronics 2024, 13, 4768. [Google Scholar] [CrossRef]
Sharma, A.; Khullar, V.; Kansal, I.; Chhabra, G.; Arora, P.; Popli, R.; Kumar, R. Gas Detection and Classification Using Multimodal Data Based on Federated Learning. Sensors 2024, 24, 5904. [Google Scholar] [CrossRef]
Rahate, A.; Mandaokar, S.; Chandel, P.; Walambe, R.; Ramanna, S.; Kotecha, K. Employing multimodal co-learning to evaluate the robustness of sensor fusion for industry 5.0 tasks. Soft Comput. 2023, 27, 4139–4155. [Google Scholar] [CrossRef]
Attallah, O. Multitask deep learning-based pipeline for gas leakage detection via E-nose and thermal imaging multimodal fusion. Chemosensors 2023, 11, 364. [Google Scholar] [CrossRef]
Blackstock, D.T. Fundamentals of Physical Acoustics; John Wiley & Sons: Hoboken, NJ, USA, 2000; pp. 2–4. [Google Scholar]
Rathod, V.T. A review of acoustic impedance matching techniques for piezoelectric sensors and transducers. Sensors 2020, 20, 4051. [Google Scholar] [CrossRef]
Yang, D.; Hou, N.; Lu, J.; Ji, D. Novel leakage detection by ensemble 1DCNN-VAPSO-SVM in oil and gas pipeline systems. Appl. Soft Comput. 2022, 115, 108212. [Google Scholar] [CrossRef]
Shimanskiy, S.; Iijima, T.; Naoi, Y. Development of acoustic leak detection and localization methods for inlet piping of Fugen nuclear power plant. J. Nucl. Sci. Technol. 2004, 41, 183–195. [Google Scholar] [CrossRef]
Tariq, S.; Bakhtawar, B.; Zayed, T. Data-driven application of MEMS-based accelerometers for leak detection in water distribution networks. Sci. Total Environ. 2022, 809, 151110. [Google Scholar] [CrossRef]
Song, Y.; Li, S. Gas leak detection in galvanised steel pipe with internal flow noise using convolutional neural network. Process Saf. Environ. Prot. 2021, 146, 736–744. [Google Scholar] [CrossRef]
Zhang, J.; Lian, Z.; Zhou, Z.; Xiong, M.; Lian, M.; Zheng, J. Acoustic method of high-pressure natural gas pipelines leakage detection: Numerical and applications. Int. J. Press. Vessel. Pip. 2021, 194, 104540. [Google Scholar] [CrossRef]
da Cruz, R.P.; da Silva, F.V.; Fileti, A.M.F. Machine learning and acoustic method applied to leak detection and location in low-pressure gas pipelines. Clean Technol. Environ. Policy 2020, 22, 627–638. [Google Scholar] [CrossRef]
Ning, F.; Cheng, Z.; Meng, D.; Wei, J. A framework combining acoustic features extraction method and random forest algorithm for gas pipeline leak detection and classification. Appl. Acoust. 2021, 182, 108255. [Google Scholar] [CrossRef]
Zaman, D.; Tiwari, M.K.; Gupta, A.K.; Sen, D. A review of leakage detection strategies for pressurised pipeline in steady-state. Eng. Fail. Anal. 2020, 109, 104264. [Google Scholar] [CrossRef]
Ning, F.; Cheng, Z.; Meng, D.; Duan, S.; Wei, J. Enhanced spectrum convolutional neural architecture: An intelligent leak detection method for gas pipeline. Process Saf. Environ. Prot. 2021, 146, 726–735. [Google Scholar] [CrossRef]
Siddique, M.F.; Ahmad, Z.; Ullah, N.; Kim, J. A hybrid deep learning approach: Integrating short-time fourier transform and continuous wavelet transform for improved pipeline leak detection. Sensors 2023, 23, 8079. [Google Scholar] [CrossRef]
Saleem, F.; Ahmad, Z.; Kim, J.M. Real-Time Pipeline Leak Detection: A Hybrid Deep Learning Approach Using Acoustic Emission Signals. Appl. Sci. 2024, 15, 185. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Wehling, R.L. Infrared spectroscopy. In Food Analysis; Springer: Boston, MA, USA, 2010; pp. 407–420. [Google Scholar]
Hodgkinson, J.; Tatam, R.P. Optical gas sensing: A review. Meas. Sci. Technol. 2012, 24, 012004. [Google Scholar] [CrossRef]
Olbrycht, R.; Kałuża, M. Optical gas imaging with uncooled thermal imaging camera-impact of warm filters and elevated background temperature. IEEE Trans. Ind. Electron. 2019, 67, 9824–9832. [Google Scholar] [CrossRef]
Gibson, G.M.; Sun, B.; Edgar, M.P.; Phillips, D.B.; Hempler, N.; Maker, G.T.; Malcolm, G.P.; Padgett, M.J. Real-time imaging of methane gas leaks using a single-pixel camera. Opt. Express 2017, 25, 2998–3005. [Google Scholar] [CrossRef]
Ilonze, C.; Wang, J.; Ravikumar, A.P.; Zimmerle, D. Methane quantification performance of the quantitative optical gas imaging (QOGI) system using single-blind controlled release assessment. Sensors 2024, 24, 4044. [Google Scholar] [CrossRef]
Wang, J.; Lin, Y.; Zhao, Q.; Luo, D.; Chen, S.; Chen, W.; Peng, X. Invisible gas detection: An RGB-thermal cross attention network and a new benchmark. Comput. Vis. Image Underst. 2024, 248, 104099. [Google Scholar] [CrossRef]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2014; Volume 27. [Google Scholar]
Guo, W.; Du, Y.; Du, S. LangGas: Introducing Language in Selective Zero-Shot Background Subtraction for Semi-Transparent Gas Leak Detection with a New Dataset. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville TN, USA, 11–15 June 2025; pp. 4490–4500. [Google Scholar]
Yu, H.; Wang, J.; Yang, J.; Huang, K.; Zhou, Y.; Deng, F.; Lu, G.; He, S. GasSeg: A lightweight real-time infrared gas segmentation network for edge devices. Pattern Recognit. 2025, 170, 111931. [Google Scholar] [CrossRef]
Zivkovic, Z. Improved adaptive Gaussian mixture model for background subtraction. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR 2004), Cambridge, UK, 26 August 2004; IEEE: New York, NY, USA, 2004; Volume 2, pp. 28–31. [Google Scholar]
Qasim, S.; Khan, K.N.; Yu, M.; Khan, M.S. Performance evaluation of background subtraction techniques for video frames. In Proceedings of the 2021 International Conference on Artificial Intelligence (ICAI), Islamabad, Pakistan, 5–7 April 2021; IEEE: New York, NY, USA, 2021; pp. 102–107. [Google Scholar]
DeCuir-Gunby, J.T.; Marshall, P.L.; McCulloch, A.W. Developing and using a codebook for the analysis of interview data: An example from a professional development research project. Field Methods 2011, 23, 136–155. [Google Scholar] [CrossRef]
Wang, Z.; Shen, X.; Sun, J.; Qiu, B.; Yu, Q. Improved Visual Background Extractor Based on Motion Saliency. In Proceedings of the 2020 IEEE International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 6–8 November 2020; IEEE: New York, NY, USA, 2020; Volume 1, pp. 209–212. [Google Scholar]
Fleet, D.; Weiss, Y. Optical flow estimation. In Handbook of Mathematical Models in Computer Vision; Springer: Boston, MA, USA, 2006; pp. 237–257. [Google Scholar]
Bruhn, A.; Weickert, J.; Schnörr, C. Lucas/Kanade meets Horn/Schunck: Combining local and global optic flow methods. Int. J. Comput. Vis. 2005, 61, 211–231. [Google Scholar] [CrossRef]
Brox, T.; Bruhn, A.; Papenberg, N.; Weickert, J. High accuracy optical flow estimation based on a theory for warping. In Proceedings of the Computer Vision—ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, 11–14 May 2004; Proceedings, Part IV 8; Springer: Berlin/Heidelberg, Germany, 2004; pp. 25–36. [Google Scholar]
Farnebäck, G. Two-frame motion estimation based on polynomial expansion. In Proceedings of the Image Analysis: 13th Scandinavian Conference, SCIA 2003, Halmstad, Sweden, 29 June–2 July 2003; Proceedings 13; Springer: Berlin/Heidelberg, Germany, 2003; pp. 363–370. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A.; Bengio, Y. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Volume 1. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2015; Volume 28. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Romera, E.; Alvarez, J.M.; Bergasa, L.M.; Arroyo, R. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2017, 19, 263–272. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3684–3692. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Siddiqui, M.F.H.; Javaid, A.Y. A multimodal facial emotion recognition framework through the fusion of speech with visible and infrared images. Multimodal Technol. Interact. 2020, 4, 46. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (PMLR), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 16133–16142. [Google Scholar]
Tang, Q.; Liang, J.; Zhu, F. A comparative review on multi-modal sensors fusion based on deep learning. Signal Process. 2023, 213, 109165. [Google Scholar] [CrossRef]
Cui, C.; Yang, H.; Wang, Y.; Zhao, S.; Asad, Z.; Coburn, L.A.; Wilson, K.T.; Landman, B.A.; Huo, Y. Deep multimodal fusion of image and non-image data in disease diagnosis and prognosis: A review. Prog. Biomed. Eng. 2023, 5, 022001. [Google Scholar] [CrossRef]
Zhao, F.; Zhang, C.; Geng, B. Deep multimodal data fusion. ACM Comput. Surv. 2024, 56, 216. [Google Scholar] [CrossRef]
Schmidt, R.M. Recurrent neural networks (rnns): A gentle introduction and overview. arXiv 2019, arXiv:1912.05911. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017; Volume 30. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Wang, Z.; Oates, T. Imaging time-series to improve classification and imputation. arXiv 2015, arXiv:1506.00327. [Google Scholar] [CrossRef]
Liang, J.; Liang, S.; Ma, L.; Zhang, H.; Dai, J.; Zhou, H. Leak detection for natural gas gathering pipeline using spa-tio-temporal fusion of practical operation data. Eng. Appl. Artif. Intell. 2024, 133, 108360. [Google Scholar] [CrossRef]
Wang, Q.; Sun, Y.; Jing, Y.; Pan, X.; Xing, M. YOLOGAS: An Intelligent Gas Leakage Source Detection Method Based on Optical Gas Imaging. IEEE Sens. J. 2024, 24, 35621–35627. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning (PMLR), Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
Ahmad, S.; Ahmad, Z.; Kim, C.H.; Kim, J.M. A method for pipeline leak detection based on acoustic imaging and deep learning. Sensors 2022, 22, 1562. [Google Scholar] [CrossRef]
Norton, S.J. Theory of Acoustic Imaging; Stanford University: Stanford, CA, USA, 1977; pp. 1–2. [Google Scholar]
Ling, K.; Han, G.; Ni, X.; Xu, C.; He, J.; Pei, P.; Ge, J. A new method for leak detection in gas pipelines. Oil Gas Facil. 2015, 4, 097–106. [Google Scholar] [CrossRef]
Ukil, A.; Libo, W.; Ai, G. Leak detection in natural gas distribution pipeline using distributed temperature sensing. In Proceedings of the IECON 2016—42nd Annual Conference of the IEEE Industrial Electronics Society, Florence, Italy, 23–26 October 2016; IEEE: New York, NY, USA, 2016; pp. 417–422. [Google Scholar]
Zhang, X.; Shi, J.; Huang, X.; Xiao, F.; Yang, M.; Huang, J.; Yin, X.; Usmani, A.S.; Chen, G. Towards deep probabilistic graph neural network for natural gas leak detection and localization without labeled anomaly data. Expert Syst. Appl. 2023, 231, 120542. [Google Scholar] [CrossRef]

Figure 1. Schematic of leak detection and location in gas pipelines [16].

Figure 2. Wavelet package transformation of AE signals in different cases [59].

Figure 3. Frame captures from the GasVid dataset at different distances and leak rates [28,30].

Figure 4. MWIRGas-YOLO network architecture [34].

Figure 5. Schematic diagram of feature, decision, and hybrid fusion [100].

Table 1. A summary of acoustic signal processing methods.

Method	Characteristic	Temporal/Frequency Resolution	Complexity	Real-Time Capability	Leak Size (mm)/ Leak Rate (L/s)
STFT [20,59,64,65]	Linear time–frequency analysis	Low/Low	Low	Yes, 1.04 s/sample [64]	$< 1 m m$
WT [14,15,53,66]	Multiscale analysis	Dynamic time-frequency resolution. Scale reciprocity	Medium	No, dependent on hardware acceleration, with implementation potential [14,15]	$< 2 m m$
FRFT [18]	Rotated time–frequency plane	Medium/Medium	Medium	Yes, but with GPU acceleration	1.006~1.288 L/min
VMD [10,11]	Constrained optimization	High/High	High	No, dependent on FPGA acceleration for feature extraction	Leak or not
MTF [19]	Time–series-to-probability matrix	-	Low	Yes, 17.97 ms/sample	Leak or not
S-Transform [20]	Adaptive-window STFT	Dynamic time–frequency resolution. Frequency reciprocity	High	No, requires the complete signal segment to generate time-frequency representations	$\leq 1 m m$
PR [20]	Phase-space visualization	-	High	No, computational complexity leads to exponentially growing processing times with increasing signal length	$\leq 1 m m$

Table 2. A comparison of models and their evaluation metrics on the publicly available OGI detection dataset.

Datasets	Type	Describe	Ref.	Task	Evaluation Metrics			Edge Computing
Datasets	Type	Describe	Ref.	Task	Accuracy	mAP50	IoU	Model	Device
GasVid	IR videos	669,600 frames of eight size classes	[28]	Binary classification	91–99%	-	-	-	-
GasVid	IR videos	669,600 frames of eight size classes	[30]	8-Classes Classification	39.1%	-	-	-	-
Gas-DB	IR/RGB images	Over 1000 RGB-T images	[74]	Semantic segmentation	-	-	56.52%	-	-
IIG	IR images	11,186 images with bounding boxes	[37]	Object detection	-	82.7%	-	7.7 M with 81.75 FPS	Snapdragon865/IMX6Quad (Qualcomm, San Diego, CA, USA)
SimGas	IR videos	Synthetically generated, pixel-level segmentation masks, 12,000 frames	[75]	Semantic segmentation	-	-	69%	0.5 FPS	NVIDIA RTX 3090 (NVIDIA, Santa Clara, CA, USA)
IGS	IR images	6426 images	[76]	Semantic segmentation	-	-	90.68% (mIoU)	6.79 M with 62.1 FPS	IMX6Quad (NXP, Eindhoven, The Netherlands)
CR/MR	IR images	MR: ~6800 images CR: ~340 images	[38]	Semantic segmentation	-	-	88.56% (mIoU)	3.65 M with 97.45 FPS	CPU: Intel Core i9 11900F (2.50 GHz/32 GB) (Intel, Santa Clara, CA, USA) GPU: NVIDIA RTX3090

IIG will be made available on request.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gong, Y.; Bao, C.; He, Z.; Jian, Y.; Wang, X.; Huang, H.; Song, X. A Review on Gas Pipeline Leak Detection: Acoustic-Based, OGI-Based, and Multimodal Fusion Methods. Information 2025, 16, 731. https://doi.org/10.3390/info16090731

AMA Style

Gong Y, Bao C, He Z, Jian Y, Wang X, Huang H, Song X. A Review on Gas Pipeline Leak Detection: Acoustic-Based, OGI-Based, and Multimodal Fusion Methods. Information. 2025; 16(9):731. https://doi.org/10.3390/info16090731

Chicago/Turabian Style

Gong, Yankun, Chao Bao, Zhengxi He, Yifan Jian, Xiaoye Wang, Haineng Huang, and Xintai Song. 2025. "A Review on Gas Pipeline Leak Detection: Acoustic-Based, OGI-Based, and Multimodal Fusion Methods" Information 16, no. 9: 731. https://doi.org/10.3390/info16090731

APA Style

Gong, Y., Bao, C., He, Z., Jian, Y., Wang, X., Huang, H., & Song, X. (2025). A Review on Gas Pipeline Leak Detection: Acoustic-Based, OGI-Based, and Multimodal Fusion Methods. Information, 16(9), 731. https://doi.org/10.3390/info16090731

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Review on Gas Pipeline Leak Detection: Acoustic-Based, OGI-Based, and Multimodal Fusion Methods

Abstract

1. Introduction

2. Acoustic-Based Leakage Detection System

2.1. Technical Framework and Implementation

2.1.1. Principle and Classification of Acoustic Detection

2.1.2. Acoustic Sensors

2.1.3. Acoustic Detection Process

2.1.4. Key Challenges and Trade-Offs

2.2. Spectral Analysis

2.3. Visualization of Acoustic Signature

3. OGI-Based Leakage Detection System

3.1. Technical Framework and Implementation

3.1.1. Infrared Imaging, Thermal Imaging, and Optical Gas Imaging

3.1.2. Principle and Classification of OGI Technologies

3.1.3. OGI-Based Leakage Detection Process

3.1.4. Key Challenges of OGI-Based Gas Leak Detection

3.2. Background Subtraction

3.3. Optical Flow Estimation

3.4. Convolutional Neural Network

3.4.1. Fully Connected Convolutional Networks

3.4.2. Fully Convolutional Networks

3.5. Vision Transformer

3.6. Non-Visual Methods

4. Multimodal Fusion Detection System

4.1. Consistency, Complementarity, and Compatibility

4.2. Classification of Multimodal Fusion

4.3. Feature-Level Fusion

4.3.1. Encoder–Decoder Fusion

4.3.2. Attention Mechanism Fusion

4.3.3. Generative Fusion

4.4. Decision-Level Fusion

5. Discussion

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI