SF6 Leak Detection in Infrared Video via Multichannel Fusion and Spatiotemporal Features

Li, Zhiwei; Zhang, Xiaohui; Xu, Zhilei; Liu, Yubo; Zhang, Fengjuan

doi:10.3390/app152011141

Open AccessArticle

SF6 Leak Detection in Infrared Video via Multichannel Fusion and Spatiotemporal Features

by

Zhiwei Li

,

Xiaohui Zhang

^*,

Zhilei Xu

,

Yubo Liu

and

Fengjuan Zhang

College of Electrical Engineering, Henan University of Technology, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(20), 11141; https://doi.org/10.3390/app152011141

Submission received: 19 August 2025 / Revised: 9 October 2025 / Accepted: 15 October 2025 / Published: 17 October 2025

Download

Browse Figures

Versions Notes

Abstract

With the development of infrared imaging technology and the integration of intelligent algorithms, the realization of non-contact, dynamic and real-time detection of SF6 gas leakage based on infrared video has been a significant research direction. However, the existing real-time detection algorithms exhibit low accuracy in detecting SF6 leakage and are susceptible to noise, which makes it difficult to meet the actual needs of engineering. To address this problem, this paper proposes a real-time SF6 leakage detection method, VGEC-Net, based on multi-channel fusion and spatiotemporal feature extraction. The proposed method first employs the ViBe-GMM algorithm to extract foreground masks, which are then fused with infrared images to construct a dual-channel input. In the backbone network, a CE-Net structure—integrating CBAM and ECA-Net—is combined with the P3D network to achieve efficient spatiotemporal feature extraction. A Feature Pyramid Network (FPN) and a temporal Transformer module are further integrated to enhance multi-scale feature representation and temporal modeling, thereby significantly improving the detection performance for small-scale targets. Experimental results demonstrate that VGEC-Net achieves a mean average precision (mAP) of 61.7% on the dataset used in this study, with a mAP@50 of 87.3%, which represents a significant improvement over existing methods. These results validate the effectiveness and advancement of the proposed method for infrared video-based gas leakage detection. Furthermore, the model achieves 78.2 frames per second (FPS) during inference, demonstrating good real-time processing capability while maintaining high detection accuracy, exhibiting strong application potential.

Keywords:

SF6 leakage detection; multi-channel fusion; attention mechanism; infrared imaging

1. Introduction

Sulfur hexafluoride (SF₆), a high-performance insulating medium with excellent electrical insulation, thermal stability, and arc-quenching properties, is widely used in power system equipment such as high-voltage circuit breakers, gas-insulated switchgear (GIS), and cable terminations. However, SF₆ leakage can significantly reduce the dielectric strength of equipment, thereby compromising safe and stable operation and posing serious safety risks to on-site personnel [1,2]. Moreover, SF₆ is a potent greenhouse gas, with a 100-year global warming potential (GWP) approximately 25,200 times that of carbon dioxide [3]. Alongside environmental concerns, regulatory frameworks also stress the need for timely SF₆ leakage detection. For instance, the UK government requires regular leak inspections of high-voltage switchgear containing SF₆ and mandates immediate repair when specific thresholds are exceeded [4].

Because SF₆ is colorless and odorless, it is difficult to detect directly by human perception, and achieving remote, automated, and real-time leakage detection remains a considerable challenge [5]. Conventional detection methods primarily rely on handheld gas detectors, which, although low-cost and easy to deploy, require substantial manual effort and fail to satisfy the requirements of automation and real-time monitoring [6]. In recent years, techniques such as photoacoustic spectroscopy [7,8], differential absorption lidar [9,10], and acoustic signal detection [11,12] have been proposed; however, these are often limited to quantitative analysis and face difficulties in accurately localizing leakage sources.

Owing to the characteristic appearance of SF₆ gas as dark gray smoke in the infrared spectrum, computer vision techniques provide a feasible solution for automated leakage detection. However, infrared video in practical applications still suffers from low resolution, insufficient contrast, and complex backgrounds, resulting in indistinct target features and high susceptibility to noise. Existing algorithms demonstrate limited accuracy in complex environments and exhibit a high miss-detection rate for small-scale leakage targets.

Based on the aforementioned challenges, this paper proposes a multi-channel temporal feature fusion model for infrared SF₆ leakage detection, termed VGEC-Net. The core innovations of the model are as follows:

(1): Spatiotemporal feature joint modeling: Utilizing the P3D-CE backbone network to jointly extract temporal and spatial features from infrared videos, thereby improving the perception of dynamic and small-scale leakage targets.
(2): Multi-scale semantic fusion: Employing a Feature Pyramid Network (FPN) to fuse semantic information across multiple feature scales, thereby improving small-target detection performance in complex backgrounds.
(3): Dynamic variation perception enhancement: Incorporating a temporal Transformer module to strengthen the model’s capability to capture dynamic changes and moving targets associated with leakage events.
(4): Real-time and high-precision detection: Achieving significant improvements over existing methods across diverse complex scenarios while maintaining computational efficiency, with advantages in both detection accuracy and false-negative suppression.

In summary, VGEC-Net not only offers an effective solution for automated, remote, and real-time SF₆ leakage monitoring but also provides a valuable technical reference for small-target detection in infrared videos and industrial gas leakage surveillance.

2. Related Works

2.1. Infrared Video Preprocessing

Infrared images captured by cameras often suffer from low resolution, insufficient contrast, and noise interference [13], which obscure target features and degrade detection performance. To improve the quality of infrared images, filtering-based denoising methods [14] are commonly employed to suppress background noise while preserving structural details, and contrast enhancement techniques such as automatic gain control (AGC) [15] are applied to increase the grayscale difference between the target and the background, thereby providing a more reliable data basis for subsequent detection.

2.2. Object Detection Methods

Since R-CNN [16] introduced convolutional neural networks into object detection, the field has undergone rapid development. Single-stage detectors, such as the YOLO series [17,18,19], SSD [20,21], and EfficientDet [22,23], as well as two-stage detection algorithms, such as Faster R-CNN [24,25] and Mask R-CNN [26], have achieved excellent performance across various diverse application scenarios. Furthermore, video object detection methods enhance detection capabilities in dynamic scenes by modeling temporal dependencies. For instance, 3DVSD [27] captures spatiotemporal features to identify smoke, while MEGA [28] employs a self-attention mechanism to integrate global and local information, thereby strengthening keyframe representations. However, these methods have not been specifically optimized for SF₆ leakage detection, and many still face challenges in meeting real-time detection requirements.

2.3. Attention Mechanisms in Object Detection

Attention mechanisms simulate the human visual focusing process by assigning different weights to features, thereby guiding the model to concentrate on key information regions and enhancing feature representations. Channel attention mechanisms (e.g., SE-Net [29], SK-Net [30], ECA-Net [31]) emphasize informative channel features while suppressing redundant ones. Spatial attention mechanisms (e.g., GC-Net [32], SA-Net [33]) generate position-specific weight maps to strengthen target region representations. Hybrid attention mechanisms (e.g., CBAM [34,35], DANet [36]) combine channel and spatial attention while maintaining model efficiency, thereby enhancing its ability to perceive salient targets. These mechanisms provide valuable insights for improving the accuracy of infrared SF₆ leakage detection.

2.4. Temporal Modeling in Detection Frameworks

In video-based detection tasks, effective temporal modeling is crucial for capturing motion cues and subtle dynamic patterns. Traditional approaches extend CNN backbones with 3D convolutions or temporal pooling, but these methods typically capture only short-term dependencies and either introduce substantial computational overhead or lose fine-grained temporal details. Recurrent neural networks (RNNs), such as LSTMs, have also been applied to sequential data, yet they suffer from vanishing gradients, limited parallelism, and degraded performance when handling long sequences.

Transformer-based modules have recently gained significant attention due to their ability to integrate long-range dependencies through self-attention mechanisms. Park et al. [37] applied transformer-based models to real operational time-series data for fault diagnosis of air handling units, achieving F1-scores exceeding 95% across diverse operating modes. In the visual domain, Wang et al. [38] demonstrated that integrating a Swin Transformer into a YOLOv10 detector substantially improved performance on non-PPE detection tasks, achieving an average AP50 of approximately 87.3%. Similarly, the Temporal Fusion Transformer (TFT) has been widely recognized for its robustness and interpretability in multi-horizon forecasting tasks. These studies underscore the effectiveness of transformers in temporal modeling and provide strong motivation for their integration into modern detection frameworks.

3. Research Methodology

The proposed VGEC-Net is constructed upon a pseudo-three-dimensional residual network. The network utilizes an improved Gaussian Mixture Model (GMM) to extract foreground masks, which are fused with the corresponding infrared frames. To ensure compatibility with convolutional backbones, the single-channel infrared images are replicated across three channels (forming a pseudo-RGB representation) and concatenated with the foreground mask, producing a four-channel input (R, G, B, Mask) for enhanced feature representation. During the feature extraction stage, the backbone network adopts a pseudo-3D residual structure (P3D-ResNet) integrated with CE-Net, which combines CBAM and dual-path ECA modules to strengthen spatial and channel modeling. Subsequently, a Feature Pyramid Network (FPN) and a temporal Transformer are incorporated to enhance multi-scale feature fusion and temporal dependency modeling. Finally, precise detection of SF₆ gas leakage regions is accomplished using the FCOS detection head. The overall network architecture is illustrated in Figure 1.

3.1. Image Preprocessing

Contrast Limited Adaptive Histogram Equalization (CLAHE) and bilateral filtering are applied to enhance the contrast and signal-to-noise ratio of the original infrared images. The Gaussian Mixture Model (GMM) constructs a background distribution for each pixel while incorporating the ViBe update strategy. Through sample replacement and spatial diffusion mechanisms, this approach enables rapid adaptation to dynamic scenes, ultimately generating foreground pixel maps with motion characteristics. The detailed process is illustrated in Figure 2.

3.1.1. Image Enhancement

CLAHE (Contrast Limited Adaptive Histogram Equalization) is a widely used image enhancement technique aimed at improving local contrast, especially in areas with uneven brightness that are too dark or too bright. Unlike traditional global histogram equalization methods, CLAHE independently performs equalization on local regions of the image, effectively preventing over-enhancement. The implementation process is as follows:

(1): Divide the input image into $N \cdot N$ equally sized subregions.
(2): Compute the histogram for each subregion. Assume that the pixel intensity values in a local region fall within the range $[0, G - 1]$ , where G denotes the number of gray-scale levels. The local histogram $H_{k}$ represents the frequency of intensity value $k$ , and is calculated as follows:

$H_{k} = \sum_{i = 1}^{N} δ (I_{i} - k)$

(1)

where $N$ is the number of pixels in the region, $I_{i}$ is the gray value of the pixels in the region, and $δ$ is the Kronecker delta function, which is used to indicate whether the pixel value is $k$ . Next, histogram equalization is performed based on the computed histogram. First the cumulative distribution function (CDF) is calculated, which is the cumulative sum of the histograms.

$C D F (k) = \sum_{i = 1}^{N} H_{j}$

(2)
(3): CLAHE introduces a contrast-limiting factor $L$ , which serves to limit the maximum height of the histogram in each block to avoid excessive contrast enhancement due to the high frequency of certain gray values. When the frequency $H_{k}$ of a certain gray value exceeds the set threshold $L$ , $H_{k}$ is directly truncated to $L$ :

$H_{k}^{'} = \min (H_{k}, L)$

(3)
(4): During histogram equalization, CLAHE uses a modified cumulative distribution function (CDF) to redistribute pixel values. For each pixel with an intensity value $I_{i}$ , the new equalized pixel value is computed using the following equation:

$I_{i}^{'} = r o u n d (\frac{C D F (I_{i}) - C D F_{m i n}}{C D F_{m a x} - {C D F}_{m i n}} \times (G - 1))$

(4)
(5): Dividing the image into multiple blocks for local equalization may lead to unnatural transitions at block boundaries. To mitigate this issue, CLAHE applies bilinear interpolation to smooth the edges between adjacent blocks, thereby enhancing local contrast continuity and overall image naturalness.

3.1.2. Motion Foreground Extraction

The traditional Gaussian Mixture Model (GMM) is commonly used for background modeling. It is based on the fundamental assumption that the value of each pixel can be represented as a weighted sum of multiple Gaussian distributions. The background modeling formula is given as follows:

P (X_{t}) = \sum_{k = 1}^{K} w_{k} \cdot N (X_{t} ∣ μ_{k}, σ_{k}^{2})

(5)

K

denotes the number of Gaussian distributions,

ω_{k}

represents the weight of the

k

-th Gaussian component, and

μ_{k}

and

σ_{k}^{2}

denote its mean and variance, respectively.

\frac{\min}{k} (\frac{|X_{t} - μ_{k}|}{σ_{k}}) < T_{k}

(6)

To overcome the limitations of the GMM in dynamic scenes, the ViBe algorithm is introduced for background updating. Unlike GMM, ViBe eliminates the need for explicit parameter estimation by employing random replacement of historical samples—a mechanism that enables faster adaptation to background changes and produces more accurate foreground masks—thereby effectively separating the infrared SF₆ gas from the background and improving the reliability of subsequent detection.

3.2. P3D-CE

The P3D-CE network is a dual-input object detection framework developed for SF₆ gas feature extraction, fusion, and detection. Its architecture is illustrated in Figure 3.

3.2.1. Feature Fusion Module

Efficient fusion of RGB video frames and foreground masks is essential for accurate SF₆ leakage detection. Conventional fusion strategies, such as simple channel concatenation or weighted summation, fail to fully leverage the complementary characteristics of the two modalities. To overcome this limitation, we introduce a cross-modal attention mechanism integrated with a fusion gating module—this combination adaptively captures inter-modal correlations and dynamically adjusts feature weights, thereby improving both the accuracy and robustness of feature representation.

The cross-modal attention mechanism is designed to capture the complementary information between RGB and foreground features. By computing attention weights for both modalities, it enables adaptive information interaction. The cross-modal attention weights are calculated as follows:

α_{r g b \to m a s k} = \frac{Q_{r g b} \cdot K_{m a s k}^{T}}{\sqrt{d_{k}}}, α_{m a s k \to} = \frac{Q_{m a s k} \cdot K_{r g b}^{T}}{\sqrt{d_{k}}}

(7)

where

d_{k}

denotes the dimension of the key vector,

Q

represents the query vector,

K

stands for the key vector, and

V

indicates the value vector. By applying the SoftMax function to normalize the attention weights and using them in a weighted sum over the value vectors, the enhanced feature representations

F_{r g b}^{a t t}

and

F_{m a s k}^{a t t}

are obtained.

F_{r g b}^{a t t} = softmax (α_{r g b \to m a s k}) \cdot V_{m a s k} + F_{r g b}, F_{m a s k}^{a t t} = softmax (α_{m a s k \to r g b}) \cdot V_{r g b} + F_{m a s k}

(8)

A fusion gating mechanism is introduced to dynamically regulate the relative contributions of the two enhanced features. The gating function adaptively computes the weighting coefficients based on the input features:

g_{r g b} = σ (L i n e a r ([F_{r g b}^{a t t}; F_{m a s k}^{a t t}])), g_{m a s k} = 1 - g_{r g b}

(9)

The enhanced features are fused using the gating weights to obtain the final fused feature representation:

F_{f u s i o n} = g_{r g b} \cdot F_{r g b}^{a t t} + g_{m a s k} \cdot F_{m a s k}^{a t t}

(10)

3D convolution is applied to the fused features for deep feature integration, resulting in the final feature representation.

3.2.2. Backbone

The backbone network adopts a multi-stage spatiotemporal feature extraction framework built upon the Pseudo-3D ResNet (P3D) architecture. At the end of each stage, an enhanced attention module (CE-Net) is integrated to enhance feature representation capability. The overall structure consists of four stages, each comprising multiple P3D blocks and CE-Net modules. Through hierarchical feature extraction, the network captures multi-scale representations and enhances its ability to focus on key spatial regions and informative channels, thereby improving the precision of SF₆ leakage feature modeling.

The P3D module decomposes a standard 3D convolutional kernel of size 3 × 3 × 3 into a 1 × 3 × 3 spatial 2D convolution and a 3 × 1 × 1 temporal 1D convolution, which are used to separately capture spatial and temporal features, as illustrated in Figure 4.

The temporal modeling process is implemented by applying the temporal convolution in a sliding-window manner along the time axis with a receptive field size of 3. This operation aggregates contextual information from the current frame and its adjacent neighbors. Formally, the output feature at position (t, h, w) can be expressed as:

o u t p u t [t, h, w] = \sum_{i = - 1}^{1} \sum_{j, k} i n p u t [t + i, h + j, w + k] \cdot w e i g h t [i, j, k]

(11)

where

i, j, k \in {- 1, 0, 1}

. Here,

i

indexes the temporal offset, and

j, k

index spatial offsets along height and width, respectively. By stacking multiple bottleneck layers, the temporal receptive field is progressively expanded from short subsequences to longer temporal contexts, enabling the backbone to model local motion continuity while preserving temporal integrity through residual connections. This mechanism is particularly well-suited to leakage detection tasks that require sensitivity to subtle and dynamically evolving temporal variations.

The CE-Net module integrates the strengths of ECA-Net and CBAM by incorporating the ECA mechanism to enhance inter-channel interactions—this helps highlight key features of leakage regions (e.g., subtle intensity variations) from background noise. The structural diagram of the module is presented in Figure 5.

3.2.3. Neck

To address the challenges of temporal modeling and multi-scale feature fusion in video-based smoke detection, a neck module is designed, comprising a Feature Pyramid Network (FPN) and a temporal Transformer. This module enhances detection performance through an effective feature fusion strategy and advanced temporal modeling approach.

An improved lateral connection mechanism is introduced for feature fusion, as defined in Equation (12).

F_{l a t e r a l} = α \cdot C o n v_{1 \times 1 \times 1} (F_{h i g h}) + β \cdot F_{l o w}

(12)

Here,

F_{h i g h}

and

F_{l o w}

represent the high-level and low-level features, respectively, while

α

and

β

are learnable weighting parameters that enhance the dynamic fusion capability between features.

Due to the high computational complexity of traditional self-attention mechanisms in video tasks, this study introduces a more efficient attention mechanism based on existing methods, as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(13)

Multi-scale feature maps with unified channel dimensions are generated through lateral convolution, top-down upsampling, and feature fusion. The two intermediate feature layers output by the FPN are processed by a temporal Transformer for temporal modeling (as shown in Figure 6), thereby enhancing the ability to capture dynamic targets. All multi-scale feature maps are then fed into the FCOS detection head to perform multi-scale object detection.

3.2.4. Head

To accurately localize and detect SF₆ gas leakage regions, an FCOS-based detection head is designed following the neck module. For the multi-scale feature maps generated by the temporal feature fusion module (FPN + temporal Transformer), the detection head employs a set of shared convolutional layers and is divided into three branches: classification, regression, and temporal offset, where the temporal offset branch specifically addresses the dynamic nature of leakage diffusion. In bounding box prediction, the regression branch estimates the distances from each location on the feature map to the four sides of the corresponding ground-truth bounding box.

During training, the total loss comprises three components: classification loss, regression loss, and temporal consistency loss. The overall loss function is defined as follows:

L = λ_{1} \cdot L_{cls} + λ_{2} \cdot L_{r e g} + λ_{3} \cdot L_{t e m p o r a l}

(14)

Following common practice in anchor-free object detection frameworks, the weights are set to

λ_{1} = λ_{2} = 1

to balance classification and regression. Since the temporal consistency term acts as an auxiliary constraint, its weight is set to

λ_{3} = 0.5

, which stabilizes training while preventing the temporal loss from dominating the optimization process. This configuration enables the model to simultaneously ensure spatial accuracy, precise localization, and temporal smoothness.

The regression loss adopts the Complete Intersection over Union (CIOU) loss, which is defined as follows:

L_{reg} = 1 - (IOU - \frac{ρ^{2} (b, b^{g t})}{c^{2}} - a v)

(15)

The temporal consistency loss is defined as:

L_{temporal} = \frac{1}{N} \sum_{i = 1}^{N} ‖△^{\land} x_{i} - △ x_{i}^{g t}‖ + ‖△^{\land} y_{i} - △ y_{i}^{g t}‖

(16)

4. Experiments

4.1. Dataset Introduction

Since SF₆ gas is invisible under visible light, its visualization primarily depends on infrared thermal imaging technology. The dataset used in this study was captured using a FLIR GF306 camera (Teledyne FLIR, LLC; Wilsonville, OR, USA) with a video resolution of 320 × 240. A total of 13 video clips were collected, and foreground images corresponding to the infrared frames were extracted in real time from the video sequences.

The dataset consists of infrared images captured by the FLIR GF306 gas thermal camera and foreground pixel images extracted in real time from video frames. These two types of image data together form the training dataset, providing support for the training and optimization of the VGEC-Net model, as illustrated in Figure 7. The dataset is divided as follows: 4992 training samples, 704 validation samples, and 704 test samples, as detailed in Table 1.

4.2. Implementation Details

The experiments were conducted on a Windows 11 operating system using Python 3.9, implemented based on the PyTorch 1.12.0 framework. Model training was performed on an NVIDIA RTX 4060Ti GPU. A fixed random seed of 42 was used for reproducibility. The initial value of the background parameter T was configured as 20, with a batch size of 32 and a weight decay coefficient of 0.0001. The Adam optimizer was selected with an initial learning rate of 0.001, and a cosine annealing schedule was applied to dynamically adjust the learning rate. The training lasted for 100 epochs with early stopping applied based on validation mAP, using a patience of 20 epochs. Data augmentation strategies included random horizontal flipping, brightness/contrast adjustment, and Gaussian noise injection to improve robustness. Since the detection head was FCOS, the model followed an anchor-free setting without predefined anchor boxes. To ensure compatibility with convolutional backbones such as P3D-ResNet and to enable fair comparison with other baseline methods, all frames were resized from their original resolution of 320 × 240 to 224 × 224 before being fed into the network.

For temporal modeling, each input sample was constructed as a video clip of T = 16 consecutive frames. A sliding-window sampling strategy with stride = 1 was adopted during both training and inference, ensuring consistent use of consecutive frames. The resulting tensor has the shape (B, C, T, H, W), where C = 4 (pseudo-RGB infrared channels + foreground mask), T = 16, and H = W = 224. In practice, we used a batch size of 16 to balance memory consumption and computational efficiency on the RTX 4060Ti GPU. This configuration provides a longer temporal context for capturing leakage dynamics, while still maintaining practical inference speed.

4.3. Evaluation Metrics

To comprehensively evaluate the performance of the object detection model, mean Average Precision (mAP) was selected as the primary evaluation metric, with supplementary detailed analysis conducted across different Intersection over Union (IoU) thresholds and object size categories. In object detection, mAP quantifies a model’s overall performance by calculating the average precision across a series of IoU thresholds—typically ranging from 0.5 to 0.95 at 0.05 intervals. IoU itself is a core metric for measuring the accuracy of bounding box predictions, defined as the ratio of the overlapping area between a predicted bounding box and its corresponding ground truth box to the area of their union. By computing mAP across multiple IoU thresholds, the model’s adaptability under varying detection precision requirements is fully captured, laying a reliable foundation for objective performance assessment. Additionally, this study integrated mean Average Precision for small objects (mAPs) into the evaluation framework as a specialized metric to address the unique challenges of detecting small targets. For the purpose of this research, “small objects” are defined specifically as SF₆ leakage targets whose bounding box area falls between 1 and 1024 pixels on standardized 224 × 224 input frames.

To further characterize detection errors, two error-oriented indicators are introduced: the False Alarm Rate (FAR) and the Miss Alarm Rate (MAR), both calculated at IoU = 0.50. Given the numbers of true positives (TP), false positives (FP), and false negatives (FN), they are defined as:

FAR = \frac{F P}{T P + F P}, MAR = \frac{F N}{T P + F N}

(17)

FAR quantifies the proportion of false detections among all predictions, reflecting the risk of over-reporting, while MAR measures the proportion of missed ground-truth instances, reflecting the risk of under-reporting. Together with mAP, these complementary metrics provide a more comprehensive assessment of model performance.

To ensure statistical reliability, all metrics (mAP, mAP50, mAPs, FAR, and MAR) are reported as the mean ± standard deviation over three independent runs with different random seeds. The low variance observed across runs (<0.5%) confirms the stability and reproducibility of the evaluation results.

4.4. Algorithm Comparison

In the performance evaluation framework for object detection algorithms, mean Average Precision (mAP) and model parameter count are core indicators for assessing algorithm effectiveness and practical applicability. This study focuses on comparing the performance of VGEC-Net with that of mainstream video object detection algorithms, conducting in-depth analysis from two dimensions—detection accuracy and model lightweightness—with detailed results presented in Table 2. Compared to other benchmark methods, VGEC-Net achieves significant performance improvements across the mAP, mAP50, and mAPs metrics: it exhibits a 61.2% increase in overall mAP relative to YOLOv5s, and more notably, a 152.9% gain in small object detection (mAPs). This demonstrates VGEC-Net’s superior capability to recognize multi-scale targets in complex scenes. As illustrated in Figure 8, VGEC-Net achieves substantial accuracy gains while maintaining a compact model size of 27.5 million parameters. Compared to heavyweight algorithms, it significantly reduces computational resource consumption without sacrificing detection performance, striking a favorable balance between accuracy and efficiency. This balance provides a robust foundation for the effective deployment of VGEC-Net in real-world application scenarios.

For clarity, the term “online” in Table 2 is defined to refer to detection methods that operate frame-by-frame or rely on short-term temporal modeling—approaches suitable for real-time deployment (e.g., the YOLO series, 3DVSD, Faster R-CNN, and VGEC-Net). By contrast, “offline” methods (e.g., MEGA) require access to multiple frames from both the past and future to perform sequence-level feature aggregation. While this offline strategy can enhance detection accuracy, it is not applicable to real-time scenarios due to its high latency.

To evaluate the performance of the VGEC-Net model, comparative experiments were conducted on the test set against other representative models. The results are shown in Figure 9 and Table 3.

4.5. Ablation Experiments

The impact of the VGEC-Net algorithm on model performance was investigated from multiple experimental perspectives, with a primary focus on three aspects: model components, input channels, and attention mechanisms.

4.5.1. Analysis of Model Components

To verify the effectiveness of each improved module, a series of ablation experiments were conducted based on the baseline P3D model. Specifically, three model variants were constructed by sequentially integrating the Feature Pyramid Network (FPN) module and the temporal Transformer; the performance of these variants was then evaluated systematically. Detailed results are presented in Table 4.

The baseline P3D model demonstrates limited performance in both overall accuracy (mAP) and small object detection (mAPs), achieving only 22.6% and 13.4%, respectively. This result indicates that the baseline model lacks sufficient capability to model complex scenes and accurately detect small-scale SF₆ leakage targets—consistent with the research gap identified earlier regarding small-target detection in leakage scenarios.

After integrating the FPN module into the baseline, the model’s performance exhibits a substantial improvement: mAP increases significantly to 43.7%, while mAPs rise to 32.1%. This marked gain confirms the effectiveness of FPN in facilitating multi-scale feature fusion, as the module enhances the model’s ability to capture fine-grained features of small objects—addressing the baseline’s weakness in small-target detection.

Building on this optimized framework, the further integration of the temporal Transformer module yields additional performance improvements: mAP reaches 45.3%, and mAP50 (mAP at IoU = 0.5) climbs to 67.3%. This outcome suggests that the temporal Transformer contributes positively to detection robustness and accuracy by modeling temporal correlations between consecutive video frames—enabling the model to better capture dynamic changes in leakage targets over time. Although the performance gain from the temporal Transformer is less pronounced than that from FPN, the module adds unique value in capturing temporal dynamics, which is critical for video-based SF₆ leakage detection.

In summary, the final model constructed by integrating all three components achieves a significant improvement in performance for complex scenes and small object detection, while maintaining a moderate increase in parameter size. This verifies the effectiveness and practical applicability of the proposed architecture in infrared gas leakage detection tasks.

4.5.2. Attention Mechanism Analysis

This study proposes a novel attention mechanism, CE-Net, for the backbone network design. CE-Net integrates the lightweight cross-channel interaction of the Efficient Channel Attention (ECA-Net) module with the dual spatial-channel attention strategy of CBAM, enabling more refined and multi-level feature enhancement. Experimental results demonstrate that CE-Net significantly improves detection performance compared to the original ECA-Net and CBAM modules, with mAP50 increasing by 3.8% and 10.4%, respectively. Detailed results and comparisons are presented in Table 5.

4.5.3. Analysis of Input Channels

To further validate the effectiveness of the dual-input strategy in SF₆ leakage detection, an ablation study was conducted on the input configuration of the P3D-EC network. Three settings were evaluated: using only the foreground mask, using only the infrared image, and using both as dual inputs. As shown in Table 6, the dual-input configuration outperforms the single-input variants across all metrics (mAP, mAP50, and mAPs), demonstrating the complementary nature of the two input modalities and their combined benefit in enhancing detection performance.

4.6. Cross-Video Validation for Generalization

To further evaluate the generalization capability of VGEC-Net, a Leave-P-Video-Out (LPVO) cross-validation experiment was conducted. Unlike the Leave-One-Video-Out (LOVO) strategy—where a single video is held out as the test set in each fold—the 13 collected videos were partitioned into three folds, with each fold containing 4–5 complete videos. In each validation round, one fold was reserved exclusively for testing, while the remaining two folds were merged as the training set. This experimental protocol ensures that all frames from the same video are assigned to either the training set or the testing set (but not both), thereby eliminating temporal information leakage (a critical confounding factor in video-based detection) and enabling a fair assessment of the model’s generalization performance to unseen video data.

The experiment adopted the same training configurations as detailed in Section 4.2. Performance was quantified using the evaluation metrics defined in Section 4.3, and the results are summarized in Table 7. As shown in Table 7, both the per-fold performance values and the aggregated statistics (mean ± standard deviation) across all three folds are reported.

These results confirm that VGEC-Net not only achieves superior detection accuracy under a fixed split but also maintains reliable performance across different video subsets. This indicates that the model can effectively generalize to previously unseen infrared video sequences, despite the limited size and diversity of the dataset.

5. Conclusions

(1): To address the issue of insufficient detection accuracy of SF₆ gas leakage in infrared video, this paper proposes VGEC-Net, a real-time detection model that integrates multi-channel inputs and temporal modeling. Built upon the P3D backbone, the model incorporates the CE-Net attention mechanism, FPN structure, and a temporal Transformer module to efficiently model dynamic and subtle leakage patterns, enhancing the spatiotemporal representation capability in complex scenarios.
(2): Experimental results on the self-constructed SF₆ leakage detection dataset show that VGEC-Net yields higher mAP and mAP50 compared with existing methods such as YOLOv8s and 3DVSD. In addition, it provides better performance in small-object detection (mAPs) and model compactness, achieving a good trade-off between detection accuracy and computational efficiency.
(3): Further evaluations show that VGEC-Net maintains strong robustness and generalization ability when applied to infrared videos with blurred backgrounds and complex environmental interference. This approach offers promising insights and technical support for intelligent detection tasks involving other industrial gases or infrared targets.

For future work, we plan to design lightweight spatiotemporal attention modules and apply knowledge distillation to reduce computational costs for edge deployment. We will also extend the dataset to more complex scenarios and additional industrial gases beyond SF₆, to improve generalization. Furthermore, we will explore integrating our method with drone-based infrared platforms and edge hardware for large-scale, autonomous industrial environmental surveillance.

Author Contributions

Writing—original draft, Z.L.; Methodology, X.Z.; Data curation, Z.L.; Funding acquisition, X.Z. and Z.X.; review & editing, Z.X. and Y.L.; validation, F.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Natural Science Project of Henan Province (Grant No. 202300410117), in part by the China Postdoctoral Science Foundation (Grant No. 2022M712382), and in part by the Natural Science Foundation of Henan Province (Grant No. 242102240129).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data are available upon request owing to restrictions such as privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xu, C.; Zhou, T.; Chen, X.; Li, X.; Kang, C. Estimating of sulfur hexafluoride gas emission from electric equipments. In Proceedings of the 2011 1st International Conference on Electric Power Equipment-Switching Technology, Xi’an, China, 23–27 October 2011; pp. 299–303. [Google Scholar]
Yang, L.; Wang, S.; Chen, C.; Zhang, Q.; Sultana, R.; Han, Y. Monitoring and Leak Diagnostics of Sulfur Hexafluoride and Decomposition Gases from Power Equipment for the Reliability and Safety of Power Grid Operation. Appl. Sci. 2024, 14, 3844. [Google Scholar] [CrossRef]
Masson-Delmotte, V.; Zhai, P.; Pirani, A.; Connors, S.L.; Péan, C.; Berger, S.; Caud, N.; Chen, Y.; Goldfarb, L.; Zhou, B.; et al. Climate Change 2021: The Physical Science Basis. Contribution of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change; Cambridge University Press: Cambridge, UK, 2021; Volume 2, 2391p. [Google Scholar]
GOV.UK. How to Operate or Service Electrical Switchgear Containing SF6 [Guidance]. 2014. Available online: https://www.gov.uk/guidance/how-to-operate-or-service-high-voltage-switchgear-containing-sf6 (accessed on 15 May 2025).
Lu, Q.; Li, Q.; Hu, L.; Huang, L. An effective Low-Contrast SF₆ gas leakage detection method for infrared imaging. IEEE Trans. Instrum. Meas. 2021, 70, 5009009. [Google Scholar] [CrossRef]
Wang, Y.; Yao, Y.; Zhao, R.; Zhang, Z.; Jing, R. SF6 Research on the Key Technology of the Gas Integrated Online Monitoring System in the Fault Early Warning and Diagnosis of GIS Equipment. In Proceedings of the 2024 Boao New Power System International Forum-Power System and New Energy Technology Innovation Forum (NPSIF), Qionghai, China, 8–10 December 2024; pp. 163–169. [Google Scholar]
Zheng, K.; Luo, W.; Duan, L.; Zhao, S.; Jiang, S.; Bao, H.; Ho, H.L.; Zheng, C.; Zhang, Y.; Ye, W.; et al. High sensitivity and stability cavity-enhanced photoacoustic spectroscopy with dual-locking scheme. Sens. Actuators B Chem. 2024, 415, 135984. [Google Scholar] [CrossRef]
Yun, Y.X.; Chen, W.G.; Sun, C.X.; Pang, C. Photoacoustic spectroscopy detection method for methane gas in transformer oil. Proc. Chin. Soc. Electr. Eng. 2008, 28, 40–46. [Google Scholar]
Yang, Z.H.; Zhang, Y.K.; Chen, Y.; Li, X.F.; Jiang, Y.; Feng, Z.Z.; Deng, B.; Chen, C.-l.; Zhou, D.F. Simultaneous detection of multiple gaseous pollutants using multi-wavelength differential absorption LIDAR. Opt. Commun. 2022, 518, 128359. [Google Scholar] [CrossRef]
Shen, Y.; Shao, K.M.; Wu, J.; Huang, F.; Guo, Y. Research progress on gas optical detection technology and its application. Opto-Electron. Eng. 2020, 47, 3–18. [Google Scholar]
Wu, H.; Chen, Y.; Lin, W.; Wang, F. Novel signal denoising approach for acoustic leak detection. J. Pipeline Syst. Eng. Pract. 2018, 9, 04018016. [Google Scholar] [CrossRef]
Wu, S.Q.; Shen, B.; Xiong, G.; Xu, W. Detection and analysis of photoacoustic signals radiated by gas plasma. Laser Infrared 2017, 47, 428–431. [Google Scholar]
Li, Y.; Zhang, Y.; Geng, A.; Cao, L.; Chen, J. Infrared image enhancement based on atmospheric scattering model and histogram equalization. Opt. Laser Technol. 2016, 83, 99–107. [Google Scholar] [CrossRef]
Zhang, F.; Hu, H.; Wang, Y. Infrared image enhancement based on adaptive non-local filter and local contrast. Optik 2023, 292, 171407. [Google Scholar] [CrossRef]
Zhang, C.J.; Fu, M.Y.; Jin, M. Resistible noise approach of infrared image contrast enhancement. Infrared Laser Eng. 2004, 33, 50–54. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Liu, Y.; Zhou, T.; Xu, J.; Hong, Y.; Pu, Q.; Wen, X. Rotating target detection method of concrete bridge crack based on YOLO v5. Appl. Sci. 2023, 13, 11118. [Google Scholar] [CrossRef]
Xu, S.; Wang, X.; Sun, Q.; Dong, K. MWIRGas-YOLO: Gas leakage detection based on mid-wave infrared imaging. Sensors 2024, 24, 4345. [Google Scholar] [CrossRef] [PubMed]
Xu, C.J.; Wang, X.F.; Yang, Y.D. Attention-YOLO: YOLO detection algorithm with attention mechanism. Comput. Eng. Appl. 2019, 55, 13–23. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multi-box detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Yang, S.; Chen, Z.; Ma, X.; Zong, X.; Feng, Z. Real-time high-precision pedestrian tracking: A detection–tracking–correction strategy based on improved SSD and Cascade R-CNN. J. Real-Time Image Process. 2022, 19, 287–302. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Zhang, R.M.; Jia, Z.N.; Li, J.X.; Wu, L.; Xu, X.; Yuan, B. Improved EfficientDet remote sensing object detection algorithm based on multi-receptive field feature enhancement. Electron. Opt. Control 2024, 31, 53–60+96. [Google Scholar]
Liu, B.; Zhao, W.; Sun, Q. Study of object detection based on Faster R-CNN. In Proceedings of the 2017 Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017; pp. 6233–6236. [Google Scholar]
Ren, Z.J.; Lin, S.Z.; Li, D.W.; Wang, L.; Zuo, J. Mask R—CNN object detection method based on improved feature pyramid. Laser Optoelectron. Prog. 2019, 56, 174–179. [Google Scholar]
Sahin, M.E.; Ulutas, H.; Yuce, E.; Erkoc, M.F. Detection and classification of COVID-19 by using faster R-CNN and mask R-CNN on CT images. Neural Comput. Appl. 2023, 35, 13597–13611. [Google Scholar] [CrossRef]
Huo, Y.; Zhang, Q.; Zhang, Y.; Zhu, J.; Wang, J. 3DVSD: An end-to-end 3D convolutional object detection network for video smoke detection. Fire Saf. J. 2022, 134, 103690. [Google Scholar] [CrossRef]
Chen, Y.; Cao, Y.; Hu, H.; Wang, L. Memory enhanced global-local aggregation for video object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10337–10346. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Global context networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 45, 6881–6895. [Google Scholar] [CrossRef]
Hu, J.; Wang, H.; Wang, J.; Wang, Y.; He, F.; Zhang, J. SA-Net: A scale-attention network for medical image segmentation. PLoS ONE 2021, 16, e0247388. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Fu, G.D.; Huang, J.; Yang, T.; Zheng, S. Lightweight attention model with improved CBAM. Comput. Eng. Appl. 2021, 57, 150–156. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–29 June 2019; pp. 3146–3154. [Google Scholar]
Park, S.; Kim, J.; Kim, J.; Wang, S. Fault Diagnosis of Air Handling Units in an Auditorium Using Real Operational Labeled Data across Different Operation Modes. J. Comput. Civ. Eng. 2025, 39, 04025065. [Google Scholar] [CrossRef]
Wang, S. Automated non-PPE detection on construction sites using YOLOv10 and transformer architectures for surveillance and body worn cameras with benchmark datasets. Sci. Rep. 2025, 15, 27043. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of VGEC-Net (arrows indicate the direction of data flow from preprocessing to the detection head, illustrating the sequential processing of infrared images and foreground masks).

Figure 2. Moving Foreground Extraction.

Figure 3. P3D-CE architecture (arrows explicitly denote the flow of features between the fusion module, backbone blocks, and attention mechanisms, clarifying how information is propagated through the network).

Figure 4. Schematic of the P3D module (S: Spatial convolution; T: Temporal convolution).

Figure 5. Structure of the CE-Net module (arrows represent the information flow across channel and spatial attention components, highlighting the sequential operations within the module).

Figure 6. Structure of the Temporal Transformer. (The FPN outputs retain the temporal dimension, with feature maps shaped (B, C_l, T, H_l, W₁), e.g., (B, 64, 16, 56, 56) in our implementation. For each spatial location (h, w), the T feature vectors across time are treated as temporal tokens and processed by multi-head self-attention, enabling the model to capture long-range dependencies across the 16-frame clip.).

Figure 7. Samples from the labeled training set: (a,c,e,g): infrared images; (b,d,f,h): foreground images extracted in real time.

Figure 8. Comparison of VGEC-Net with mainstream detection models on the validation set: (a) Parameters vs. mAP; (b) Parameters vs. mAP50; (c) Parameters vs. mAPs; (d) mAP over Training Epochs.

Figure 9. Detection results of the VGEC-Net and some mainstream detection models in test set.

Table 1. Number of Samples in the Training, Validation, and Test Sets (INF: infrared image; FPI: foreground pixel image).

	INF	FPI
Training Set	4992	4992
Validation Set	704	704
Test Set	704	704

Table 2. Comparison of detection performance on the validation set in terms of mAP, mAP50, mAPs, FAR, and MAR. All results are reported as mean ± standard deviation over three independent runs with different random seeds (Params: number of parameters; FPS: frames per second).

Method	Detection	Input Size	mAP (%)	mAP50 (%)	mAPs (%)	FAR (%)	MAR (%)	Params (M)	FPS
YOLOv5s	Online	224 × 224	37.1 ± 0.2	65.3 ± 0.2	15.7 ± 0.3	8.7 ± 0.2	28.1 ± 0.2	8.7	147.8
YOLOv5n	Online	224 × 224	40.3 ± 0.4	60.8 ± 0.2	11.9 ± 0.3	9.4 ± 0.2	9.4 ± 0.3	2.9	137.5
YOLOv8n	Online	224 × 224	39.8 ± 0.3	67.4 ± 0.2	9.7 ± 0.2	8.4 ± 0.1	8.4 ± 0.2	4.2	90.6
YOLOv8s	Online	224 × 224	43.9 ± 0.2	69.9 ± 0.3	7.3 ± 0.3	7.8 ± 0.1	26.7 ± 0.2	14.3	79.4
3DVSD	Online	224 × 224	50.7 ± 0.2	81.6 ± 0.1	25.9 ± 0.3	5.8 ± 0.1	22.4 ± 0.1	55.9	59.7
Faster-RCNN	Online	224 × 224	37.4 ± 0.5	70.9 ± 0.5	19.4 ± 0.4	7.5 ± 0.3	26.7 ± 0.2	45.6	18.2
MEGA	Offline	224 × 224	47.4 ± 0.3	78.4 ± 0.2	28.8 ± 0.3	6.3 ± 0.1	23.6 ± 0.2	59.8	13.4
VGEC-Net	Online	224 × 224	59.8± 0.1	89.7± 0.3	39.7± 0.3	4.1 ± 0.1	18.2 ± 0.2	27.5	77.4

Table 3. Comparison of mAP50, FAR, and MAR between VGEC-Net and other models on the test set.

Method	mAP (%)	mAP50 (%)	FAR (%)	MAR (%)	FPS
YOLOv5s	34.7	63.8	0.031	0.689	153.7
YOLOv5n	38.6	56.4	0.083	0.391	137.9
YOLOv8n	46.7	67.9	0.095	0.310	90.2
YOLOv8s	47.1	68.4	0.058	0.388	87.4
3DVSD	53.7	80.7	0.051	0.235	57.9
Faster-RCNN	30.9	65.5	0.313	0.303	20.8
MEGA	46.3	78.6	0.041	0.211	12.8
VGEC-Net	61.7	87.3	0.035	0.158	78.2

Table 4. Analysis of Model Components.

Model	mAP (%)	mAP50 (%)	mAPs (%)	Params (M)
P3D	22.6	54.7	13.4	18.8
P3D + FPN	43.7	65.7	32.1	20.1
P3D + FPN + Transformer	45.3	67.3	32.3	20.8

Table 5. Performance Comparison of Different Attention Mechanisms.

Attention Mechanism	mAP (%)	mAP50 (%)	mAPs (%)	Params (K)
ECA-Net	54.4	79.3	35.8	3.4
CBAM	58.7	85.9	37.9	8
CE-Net	59.8	89.7	39.7	9.6

Table 6. Impact of Different Input Channel Configurations on SF₆ Leakage Detection Performance.

INF	FPI	mAP (%)	mAP50 (%)	mAPs (%)
	√	13.5	40.9	9.3
√		46.3	67.3	33.3
√	√	47.6	69.3	34.1

Table 7. Performance of VGEC-Net Across Different Held-Out Video Folds in LPVO Experiment.

Fold (Held-Out Videos)	mAP (%)	mAP50 (%)	mAPs (%)	FAR (%)	MAR (%)
F1 (V1–V4)	65.4	94.1	43.7	2.7	6.2
F2 (V5–V8)	61.5	89.4	40.3	3.6	9.1
F3 (V9–V13)	54.0	81.4	36.4	4.8	13.5
Mean ± std	60.3 ± 4.7	88.3 ± 5.2	40.1 ± 3.0	3.7 ± 0.9	9.6 ± 3.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Zhang, X.; Xu, Z.; Liu, Y.; Zhang, F. SF6 Leak Detection in Infrared Video via Multichannel Fusion and Spatiotemporal Features. Appl. Sci. 2025, 15, 11141. https://doi.org/10.3390/app152011141

AMA Style

Li Z, Zhang X, Xu Z, Liu Y, Zhang F. SF6 Leak Detection in Infrared Video via Multichannel Fusion and Spatiotemporal Features. Applied Sciences. 2025; 15(20):11141. https://doi.org/10.3390/app152011141

Chicago/Turabian Style

Li, Zhiwei, Xiaohui Zhang, Zhilei Xu, Yubo Liu, and Fengjuan Zhang. 2025. "SF6 Leak Detection in Infrared Video via Multichannel Fusion and Spatiotemporal Features" Applied Sciences 15, no. 20: 11141. https://doi.org/10.3390/app152011141

APA Style

Li, Z., Zhang, X., Xu, Z., Liu, Y., & Zhang, F. (2025). SF6 Leak Detection in Infrared Video via Multichannel Fusion and Spatiotemporal Features. Applied Sciences, 15(20), 11141. https://doi.org/10.3390/app152011141

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SF6 Leak Detection in Infrared Video via Multichannel Fusion and Spatiotemporal Features

Abstract

1. Introduction

2. Related Works

2.1. Infrared Video Preprocessing

2.2. Object Detection Methods

2.3. Attention Mechanisms in Object Detection

2.4. Temporal Modeling in Detection Frameworks

3. Research Methodology

3.1. Image Preprocessing

3.1.1. Image Enhancement

3.1.2. Motion Foreground Extraction

3.2. P3D-CE

3.2.1. Feature Fusion Module

3.2.2. Backbone

3.2.3. Neck

3.2.4. Head

4. Experiments

4.1. Dataset Introduction

4.2. Implementation Details

4.3. Evaluation Metrics

4.4. Algorithm Comparison

4.5. Ablation Experiments

4.5.1. Analysis of Model Components

4.5.2. Attention Mechanism Analysis

4.5.3. Analysis of Input Channels

4.6. Cross-Video Validation for Generalization

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI