SFMattingNet: A Trimap-Free Deep Image Matting Approach for Smoke and Fire Scenes

Ma, Shihui; Xu, Zhaoyang; Yan, Hongping

doi:10.3390/rs17132259

Open AccessArticle

SFMattingNet: A Trimap-Free Deep Image Matting Approach for Smoke and Fire Scenes

by

Shihui Ma

¹,

Zhaoyang Xu

² and

Hongping Yan

^1,*

¹

School of Information Engineering, China University of Geosciences, Beijing 100083, China

²

Department of Paediatrics, University of Cambridge, Cambridge CB2 0QQ, UK

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(13), 2259; https://doi.org/10.3390/rs17132259

Submission received: 29 April 2025 / Revised: 9 June 2025 / Accepted: 23 June 2025 / Published: 1 July 2025

(This article belongs to the Special Issue Advanced AI Technology for Remote Sensing Analysis)

Download

Browse Figures

Versions Notes

Abstract

Smoke and fire detection is vital for timely fire alarms, but traditional sensor-based methods are often unresponsive and costly. While deep learning-based methods offer promise using aerial images and surveillance images, the scarcity and limited diversity of smoke-and-fire-related image data hinder model accuracy and generalization. Alpha composition, blending foreground and background using per-pixel alpha values (transparency parameters stored in the alpha channel alongside RGB channels), can effectively augment smoke and fire image datasets. Since image matting algorithms compute these alpha values, the quality of the alpha composition directly depends on the performance of the smoke and fire matting methods. However, due to the lack of smoke and fire image matting datasets for model training, existing image matting methods exhibit significant errors in predicting the alpha values of smoke and fire targets, leading to unrealistic composite images. Therefore, to address these above issues, the main research contributions of this paper are as follows: (1) Construction of a high-precision, large-scale smoke and fire image matting dataset, SFMatting-800. The images in this dataset are sourced from diverse real-world scenarios. It provides precise foreground opacity values and attribute annotations. (2) Evaluation of existing image matting baseline methods. Based on the SFMatting-800 dataset, traditional, trimap-based deep learning and trimap-free deep learning matting methods are evaluated to identify their strengths and weaknesses, providing a benchmark for improving future smoke and fire matting methods. (3) Proposal of a deep learning-based trimap-free smoke and fire image matting network, SFMattingNet, which takes the original image as input without using trimaps. Taking into account the unique characteristics of smoke and fire, the network incorporates a non-rigid object feature extraction module and a spatial awareness module, achieving improved performance. Compared to the suboptimal approach, MODNet, our SFMattingNet method achieved an average error reduction of 12.65% in the smoke and fire matting task.

Keywords:

smoke and fire imagery; aerial imagery; image matting dataset; deep learning-based image matting approach

1. Introduction

Fire remains one of the most significant threats to both human life and property, necessitating continuous research into its timely detection and mitigation [1]. First, fire often causes substantial property damage. It can destroy buildings, vehicles, and other assets, leading to incalculable economic losses for individuals and businesses. Second, fire poses a severe threat to human safety. The heat, smoke, and toxic gases generated by fires can result in poisoning, suffocation, burns, and even fatalities. Furthermore, fire has adverse effects on the environment. The release of carbon dioxide and other greenhouse gases exacerbates global climate change by intensifying the greenhouse effect. Early fire detection and prompt response are critical in minimizing harm to property, lives, and the environment. Therefore, research on effective smoke and fire detection methods is crucial in enabling timely identification, particularly through aerial imagery. Such advancements are vital for applications like forest fire prevention, supporting early hazard reporting and enhancing emergency response capabilities.

Traditional smoke and fire detection methods primarily rely on sensors to detect changes in the physical and chemical parameters caused by fire incidents, such as temperature sensors, gas sensors, smoke sensors, light sensors, etc. [2]. These sensors have been widely adopted due to their simple design and effective detection capabilities within a limited range. However, these sensors have notable limitations: (1) They only detect fires when flames reach their immediate vicinity, delaying response and allowing fires to escalate. (2) They have a high cost in that they require frequent calibration and maintenance, which drives up expenses. Thus, these limitations make sensor-based smoke and fire detection systems expensive and prone to delayed fire alarms.

In recent years, large-scale image data, enabled by satellite devices and aircraft deployment, has provided developers with a key opportunity to detect smoke and fire targets. Researchers can leverage the rapid advancements in deep learning-based image processing techniques to design more advanced and effective smoke and fire detection methods. Compared with traditional sensor-based smoke and fire detection methods, deep learning-based detection methods [3,4,5] offer the following advantages. (1) Wider detection range: these methods can detect fires at close range and from long distances. (2) Lower cost: existing surveillance cameras and drones can be utilized, eliminating the need for smoke and fire sensors, thus reducing costs. (3) Better visualization: with the proliferation of high-definition cameras, these methods can provide high-resolution images, enabling commanders to intuitively and accurately assess the fire scene. Currently, deep learning-based smoke and fire detection research can be broadly categorized into three types of approaches: image classification-based [3,6], image segmentation-based [5,7,8,9,10,11], and object detection-based [4,12,13,14].

Existing deep learning-based smoke and fire detection methods require a large and diverse set of training samples. However, smoke and fire image collection is challenging due to data scarcity and limited scene coverage. As shown in Figure 1, an image matting approach can help to easily generate a quantity of smoke and fire images with rich scenes [15]. Specifically, image matting can extract smoke and fire as foregrounds, which are blended with different backgrounds via alpha composition. Alpha composition refers to the process of blending foregrounds and backgrounds based on alpha values. The alpha value is a per-pixel transparency parameter stored in the alpha channel, which complements the three RGB color channels in each pixel [16]. Therefore, alpha composition enables the generation of realistic synthetic data. However, due to the lack of datasets specifically designed for training and evaluating deep learning-based smoke and fire image matting approaches, existing image matting methods often produce significant errors in predicting smoke and fire objects’ opacity values (alpha values). This inaccuracy impacts the realism of synthetic smoke and fire images, ultimately affecting the performance of smoke and fire detection models.

To address the above issues, it is essential to develop a dedicated dataset tailored for smoke and fire matting tasks. Therefore, this paper introduces the smoke and fire image matting dataset, SFMatting-800 [17]. Based on this dataset, we evaluated existing image matting methods, revealing their advantages and limitations while providing a benchmark for further algorithm optimization. Building upon this benchmark, we propose an algorithm named SFMattingNet for Smoke and Fire image Matting. Considering smoke’s fluidity and transparency characteristics, the network incorporates a spatial awareness module and a non-rigid feature extraction module, enhancing its ability for smoke and fire perception and feature extraction. To summarize, the main contributions of this paper are as follows:

(1): A high-precision, multi-scene, multi-attribute, and fine-grained smoke and fire image matting dataset, SFMatting-800, was constructed, encompassing factories, rural areas, forests, grasslands, urban environments, etc. This dataset provides precise foreground alpha values and detailed attribute annotations of smoke and fire objects.
(2): Based on the SFMatting-800 dataset, the performance of existing matting methods was evaluated, providing a solid and reliable evaluation benchmark for better model design.
(3): A trimap-free image matting network, SFMattingNet, is proposed, which takes a single image as input without using the trimap. SFMattingNet achieved state-of-the-art performance on the smoke and fire image matting task. Ablation experiments on the SFMatting-800 dataset demonstrated the effectiveness of each module. Compared to other baselines, the proposed SFMattingNet achieves higher accuracy in the alpha value prediction of smoke and fire objects.

2. Related Work

2.1. Image Matting

Image matting refers to the technique of precisely extracting the soft labels of foreground objects from the given image while preserving the edge structure of the foreground as much as possible [18]. The image matting process can be mathematically described as follows:

I_{i} = α_{i} F_{i} + (1 - α_{i}) B_{i}, α_{i} \in [0, 1],

(1)

where I represents the input image, F represents the foreground image, and B represents the background image. The color of each pixel is a linear combination of the corresponding foreground and background colors, where

α

is the foreground opacity value, which ranges from 0 to 1;

α_{i} = 1

indicates that the pixel is the absolute foreground; and

α_{i} = 0

indicates that the pixel is the absolute background. A value of

α_{i} \in [0, 1]

corresponds to a transition region. For a three-channel RGB image, Equation (1) evolves into a system of three equations with seven unknown variables. Therefore, additional constraints are needed to solve the system. The most commonly used constraint in image matting is the trimap, which divides the original image into three regions: the absolute foreground, the absolute background, and the transition region [19].

2.1.1. Traditional Matting Methods

Propagation-based Image Matting Methods

Propagation-based image matting methods are grounded in a key assumption: adjacent pixels have similar alpha values. In this approach, the alpha values are propagated from known regions to transition regions based on this similarity. For example, the Close-Form (CF) matting method [20] solves for the global optimal solution of the alpha channel by constructing a cost function that optimizes the influence of the foreground and background colors, aiming to eliminate the interference of colors on the alpha channel estimation. The K-Nearest-Neighbor (KNN) matting algorithm [21] utilizes the non-local principle through using the nearest neighbors to match non-local regions, which is achieved by considering both local and non-local contextual information. The Random Walk image matting algorithm [22] constructs a graph representing the relationships between pixels and propagates alpha values from known-to-unknown points via random walk. The Poisson matting algorithm [23] solves the Poisson equation using trimap information to obtain more accurate image matting results. Propagation-based image matting methods propagate alpha values among neighboring pixels, which rely on the computation of pixel-wise similarity. Thus, they perform poorly when dealing with images that contain non-continuous target objects.

Sampling-based Image Matting Methods

Sampling-based image matting methods select an appropriate foreground–background pixel pair from the foreground and background regions marked by the trimap for each unknown pixel. The pixel’s alpha value in the transition region can be calculated by comparing its pixel value with the corresponding foreground–background pixel pair. For example, He et al. [24] utilized a global sampling algorithm, and thus they reduced the possibility of missing truly suitable foreground–background color samples. To narrow the sampling space, Wang et al. [25] adopted a local sampling strategy, sampling evenly along the edges of known foreground and background regions, evaluating the reliability of these sampling areas, and then selecting the most reliable samples for alpha value estimation. Shahrian et al. [26] clustered the foreground and background colors using a clustering method, and they then used the pixel mean of each clustered region as candidate samples for sampling. Feng et al. [27] employed a sparse coding method to select multiple foreground–background pixel pairs for each pixel in the transition region. However, the accuracy of sampling-based image matting methods depends heavily on the precision of selected foreground–background pixel pairs. When the image contains complex color distributions, or when colors of the unknown regions (either foreground or background) do not appear in the known regions, the sampling-based image matting methods are prone to errors.

To summarize, traditional image matting methods primarily rely on low-level features (e.g., color and texture), which limits their performance. Moreover, when images contain non-continuous target objects, or the colors of the unknown regions do not appear in the known regions, the prediction results are often erroneous. To overcome these limitations, researchers have turned to deep learning techniques to continuously explore more advanced image matting methods.

2.1.2. Deep Learning-Based Image Matting Methods

Deep learning-based image matting methods can be roughly divided into two categories based on whether auxiliary input is required: methods with auxiliary input and trimap-free methods.

Methods with Auxiliary Input

Image matting is an inherently ill-posed problem because, given only an RGB image, the alpha matte, foreground, and background layers are all unknown. This underdetermined nature necessitates additional constraints or auxiliary inputs to resolve a feasible solution. Common constraints include trimaps [28], rough segmentation maps [29], background images [30], scribbles [31], user clicks [32], or textual descriptions [33]. These constraints are typically input alongside the original image into the image matting network to achieve more accurate prediction results.

The most common auxiliary input is the trimap, and such methods are called trimap-based methods. For example, the deep image matting (DIM) method [28] uses a simple single-branch network architecture. The original image and trimap are processed using a fully convolutional VGG-16 network [34] with skip connections to obtain a rough alpha matte, which is then refined through an optimization network to produce the final alpha matte, which is shown in Figure 2. AlphaGAN [35] and TransMatting [19] introduced Generative Adversarial Networks (GANs) [36] and Transformers [37] into a single-branch network for alpha value prediction. IndexNet [38] introduced an innovative module that generates appropriate downsampling and upsampling indices based on image features to improve boundary prediction accuracy. The Guided Contextual Attention (GCA) network [39] utilizes a global context attention mechanism to guide the propagation of high-level alpha values based on the learned low-level features. The aforementioned methods employ a relatively simple single-branch network. Multi-branch networks process different inputs through separate branches and generate the final output through feature fusion. Tang et al. [40] used a dual-branch network to predict the background and foreground colors of the transition region separately. Hou et al. [41] proposed a network that includes two encoding sub-networks. One sub-network performs small downsampling to capture local information, while the other performs large downsampling to capture global context information. Cai et al. [42] decomposed image matting into two sub-tasks, trimap adaptation and alpha value estimation. However, all auxiliary input-requiring methods are unsuitable for real-time applications due to the need for user interaction.

Trimap-Free Methods

To overcome the limitations of image matting methods that rely on auxiliary inputs and make image matting applicable in various industrial scenarios, researchers have proposed automatic matting methods that do not depend on any auxiliary inputs and solely predict alpha mattes from the original image. These approaches can be classified into two categories based on the network model architecture: single-branch and multi-branch.

In single-branch networks, the intermediate layers are typically used to generate auxiliary inputs, such as trimaps or rough segmentation maps, transforming the task into an image matting problem with auxiliary inputs. For example, Deep Automatic Portrait Matting (DAPM) [43] generates a trimap in the intermediate layers and then predicts the alpha values based on it. The Late Fusion Matting (LFM) algorithm uses a DenseNet-201 [44]-based encoder to obtain foreground and background probability maps, which are then input into a fusion network to predict a high-precision alpha matte [45]. In multi-branch networks, the image matting task is typically decomposed into two sub-tasks: semantic prediction and detail prediction. The Global and Fine-grained Matting (GFM) algorithm [46] uses a shared encoder and two independent decoders to collaboratively complete the high-level semantic segmentation and low-level detail matting of animals. MODNet [18] and PP-Matting [47] perform both semantic estimation and detail prediction, and they then use a fusion branch to complete the prediction task. AIM-Net uses a spatial attention module to guide the learning of detailed information in the transition region based on the semantic features learned from a semantic decoder [48]. Such methods require no user interaction, do not rely on auxiliary inputs, and are applicable to real-time scenarios, thereby making them more worthy of further investigation.

2.2. Image Matting Dataset

The deep learning-based image matting methods largely depend on high-quality datasets. Based on the source of the images in existing image matting datasets, they are categorized into synthetic image-based datasets and natural image-based datasets.

Manually annotating fine foreground alpha mattes is highly time-consuming, so many image matting datasets are primarily composed of synthetic images. These synthetic images are generated by extracting foreground elements from simple backgrounds and then combining them with complex backgrounds using alpha blending techniques [20]. For example, the Composition-1K dataset [28] expands the diversity of the AlphaMatting dataset [49]. It first manually annotates alpha mattes using Photoshop and then composites foregrounds onto new backgrounds. The Distinction-646 dataset [50] contains 596 training images and 50 test images, increasing the variety of categories in the dataset. The Transparent-460 dataset [19] is designed for matting tasks involving high-transparency objects, such as ice cubes and glass cups. Human-2K [51] is a high-resolution portrait matting dataset consisting of 2100 synthetic images. However, the compositing artifacts between the foreground and background in synthetic images can increase the discrepancy with natural images.

To enhance the generalization ability of image matting models for real-world scenarios, datasets based on natural images have emerged. The AlphaMatting dataset [49] contains only 27 training images and 8 test images. Limited diversity and scale restrict its usefulness for neural network training. The DAPM dataset [43] is a portrait matting dataset generated with accurate labels from the Close-Form Matting [20] and KNN Matting [21] methods. To protect the privacy of portrait information, Li et al. [52] proposed the first large-scale matting dataset aimed at protecting portrait privacy, the P3M-10K dataset. It contains 10,421 finely annotated images with privacy protection features. The AM-2K dataset [46] is used for animal image matting tasks, containing 20 categories and 2000 animal images.

2.3. Deep Learning-Based Smoke and Fire Detection Methods

Currently, deep learning-based smoke and fire detection methods can be broadly categorized into three approaches: image classification-based methods, image segmentation-based methods, and object detection-based methods. Image classification-based detection methods output a label for the input image, determining whether the image contains fire or smoke [3,6]. Although image classification-based fireworks detection methods can identify the presence of fireworks, they cannot determine the exact location of the fireworks, which poses certain limitations. Object detection-based detection methods simultaneously localize and classify smoke and fire, marking detected objects with bounding boxes [4,12,13,14]. These methods can be categorized into two-stage and one-stage approaches. Two-stage methods first generate region proposals and then perform refinement, offering high accuracy but slower speed [12]. One-stage methods directly perform classification and localization on the entire image, offering faster inference but generally lower accuracy [13]. Image segmentation-based detection methods classify each pixel to determine its category, providing information on the size and shape of the fire or smoke, thereby offering a more comprehensive capability for smoke and fire detection [5,7,8,9,10,11].

2.4. Smoke and Fire Dataset

To effectively leverage deep learning for smoke and fire detection, some researchers have proposed relevant smoke and fire detection datasets. Dunnings et al. [3] specifically designed a dataset for fire classification, containing both fire and non-fire images, with an imbalance between positive and negative samples. Venoancio et al. [4] introduced the DFire dataset with bounding box annotations, which is used for smoke and fire object detection. The DFS dataset [53] is a smoke and fire dataset, providing high-quality bounding box annotations. The SMOKE5K dataset [5] is used for fire image segmentation tasks. This dataset is similar to the one proposed in this paper. However, it primarily consists of synthetic images rather than real-world images, and it does not provide alpha values for the smoke pixels. The BoWFire dataset [54] provides segmentation annotations for burning areas. In summary, the above datasets are used for smoke and fire image classification, segmentation, and detection tasks. However, there is currently no smoke and fire dataset for image matting tasks specifically. Thus, the smoke and fire image matting dataset SFMatting-800 is proposed in this paper and provides precise alpha value annotations.

3. Datasets Generation

To train and optimize the smoke and fire image matting model, a smoke and fire image matting dataset, SFMatting-800, is first proposed. This dataset not only features fine-grained foreground alpha value annotations, but it also includes rich attribute annotations. Based on this dataset, existing traditional image matting methods, as well as deep learning-based trimap-based and trimap-free image matting methods, are evaluated. The strengths and weaknesses of these methods are analyzed and compared, thereby providing a benchmark for the subsequent construction and improvement of smoke and fire matting models.

SFMatting-800 was constructed through multiple stages, including data collection, data filtering, data annotation, and data quality control. The flowchart for the construction of the SFMatting-800 dataset is illustrated in Figure 3. The following sections provide detailed descriptions of the workflows for data collection, data filtering, data annotation, and data quality control.

3.1. Data Collection

Initially, targeted smoke and fire images should be collected from the internet across diverse scenarios to ensure the model generalizes well under varying conditions. As shown in Figure 3, automated web crawling scripts were developed using Python 3.6 to enhance retrieval efficiency. These scripts are capable of automatically scraping relevant image data from search engines such as Baidu, Google, and Bing based on specified keywords. Keywords used for retrieval include forest fires, grassland fires, urban fires, rural cooking smoke, industrial chimney smoke, straw burning, and plain fires, among others. Approximately 50,000 images related to smoke and fire were collected using these keywords, covering a wide range of outdoor scenarios.

3.2. Data Filtering

Although a substantial number of images were acquired from the internet, the quality of the images varied significantly, with inconsistencies in size and relevance to the theme of this paper. To advance the data annotation process, it was necessary to filter out high-quality images from the vast collection of acquired data. The core principle of data filtering was to ensure that the images contained prominent smoke, fire, or smoke-and-fire targets. In particular, priority was given to retaining aerial images of smoke and fire, which are critical for applications such as forest fire prevention. To this end, the following manual screening mechanism was established: the shortest side of the image data must exceed 100 pixels, and the foreground target must be conspicuous. Through this screening process, 1500 high-quality images were ultimately obtained. These images encompass a variety of outdoor application scenarios, ensuring the generalization capability of the subsequent model.

3.3. Data Annotation

After careful data filtering, it is necessary to annotate the collected images, obtaining the location of the smoke and fire targets and the alpha values of the pixels where the smoke and fire are located in each image.

Directly using manual annotation to determine the alpha values of targets is a highly challenging task as it is both time-consuming and labor-intensive. To save resources, a computer-assisted semi-automatic method was employed in this paper for annotation, corresponding to Step 1 in Figure 3. Initially, each original image is divided into three parts: foreground, background, and transition regions. The annotation result from this process is referred to as a trimap. Three annotators independently annotate the same image, resulting in three corresponding trimaps for each image. Subsequently, the original images and their respective trimaps are input into the PyMatting library (PyMatting is available at https://github.com/pymatting/pymatting, accessed on 16 February 2025), which supports five traditional machine learning image matting methods, including Close-Form (CF) Matting [20], K-Nearest Neighbor (KNN) Matting [21], Learning Based Digital Matting (LBDM) [55], Large Kernel Matting (LKM) [56], and Random Walk (RW) Matting [22]. Using the three manually annotated trimaps as input, a total of 15 corresponding original alpha mattes were generated for each image, corresponding to Step 2 in Figure 3. Pre-trained deep learning methods were not used to generate alpha mattes because existing deep learning methods exhibit poor generalization capabilities in smoke and fire tasks. Finally, the attributes of each image pair were annotated, including the category, the color of smoke, and the presence of fire. Depending on the fire scenario, the categories included industrial chimney smoke, rural cooking smoke, forest fires, grassland fires, urban fires, straw burning, etc. The colors of smoke included white, gray, black, brown, light yellow, etc. The original images are stored in JPG format, the annotated alpha mattes are in BMP format, and the attribute annotations are in CSV files.

3.4. Data Quality Control

After the above data annotation, data quality assessment was conducted through a manual scoring mechanism and iterative processes, rigorously monitoring and enhancing the quality of the dataset. The quality evaluation criteria included whether the generated alpha mattes were complete (without obvious over- or under-segmentation), whether they exhibited visual consistency across different regions, and whether they demonstrated algorithmic consistency with similar images.

The specific data quality evaluation process is as follows: Firstly, the initial quality assessment is conducted after generating 15 original alpha mattes using the PyMatting library, corresponding to Step 3 in Figure 3. Five evaluators were invited to assess these 15 original alpha mattes according to the aforementioned criteria. If a candidate alpha matte met the criteria, it scored 1 point; otherwise, it scored 0. The scores from different evaluators were accumulated, resulting in a final score between 0 and 5 for each original alpha matte. The two candidate alpha mattes with the highest scores were selected. Secondly, these two candidate alpha mattes were composited with three different backgrounds for alpha composition, corresponding to Step 4 in Figure 3. Five evaluators were invited to assess the six composite results using the same scoring criteria. If a candidate alpha matte met the criteria in two or three backgrounds, it scored 1 point. Finally, each candidate alpha matte’s score ranges between 0 and 5, and the candidate alpha mattes were classified based on their scores, corresponding to Step 5 in Figure 3. Candidate alpha mattes with a score of 5 were considered the final alpha mattes; those with a score of 4 were manually modified using image processing software and then used as the final alpha mattes; those with a score of 3 underwent a new round of evaluation by generating new trimaps through dilation and erosion operations [57], which were then input into the open-source PyMatting library along with the original images (with this iteration only being performed once); and candidate alpha mattes with other scores are discarded.

Up to this point, through the processes of data collection, data filtering, data annotation, and data quality assessment, the first smoke and fire image matting dataset, SFMatting-800, was constructed. This dataset comprises a total of 800 images annotated with high-precision opacity labels. Examples of different categories and attributes within the SFMatting-800 dataset are illustrated in Figure 4. To ensure reproducibility and support further studies, the dataset is accessible online for review (the SFMatting-800 dataset is available at https://github.com/buguaigiser/SFMatting-800-Dataset, accessed on 27 May 2025).

4. Methods

4.1. Framework of SFMattingNet

Due to smoke and fire transparency and non-rigid characteristics, a trimap-free smoke and fire image matting network, SFMattingNet, is proposed. The architecture is shown in Figure 5. The proposed algorithm is built upon the portrait matting network MODNet [18] and aims to efficiently and automatically predict the alpha values of the foreground smoke and fire pixels. The network adopts a three-branch architecture: a low-resolution semantic estimation branch, a high-resolution detail estimation branch, and a semantic-detail fusion branch. To enhance the network’s perception of spatial dependencies among pixels, especially for the pixels of smoke and fire, a spatial awareness module was incorporated into each branch. Additionally, to improve the network’s ability to extract features of non-rigid objects, a non-rigid object feature extraction module was introduced in the high-resolution detail estimation branch. A detailed description of these proposed modules is shown below.

4.1.1. Low-Resolution Semantic Estimation Branch

The low-resolution semantic estimation branch captures multi-scale semantic information of the foreground object and predicts the overall contour of the foreground object. SFMattingNet employs MobileNetV2 [58] as the backbone network and, on this basis, introduces a lightweight residual block based on separable convolutions, which expands the width and depth of the network. While maintaining a lightweight network, this approach enhances the model’s feature extraction and representation capabilities, allowing it to handle background and detail information better in the image, thus improving the accuracy and robustness of the matting process. Additionally, to enhance the model’s ability to perceive pixel-space dependencies, a spatial awareness module was added to this branch to collect contextual information from all pixels. This branch performs feature extraction and dimensionality reduction through stacked convolutional blocks, improving the model’s generalization ability. The convolutional block consists of a convolutional layer, batch normalization layer, and activation function.

4.1.2. High-Resolution Detail Estimation Branch

The high-resolution detail estimation branch focuses on the edge transition regions of the foreground object in the input image. The input consists of

2 \times

downsampled results of the original image, i.e., the intermediate output of the semantic branch and the final output of the semantic branch, while the output is a foreground image with refined edges. To improve the model’s ability to extract features from non-rigid objects, the network employs a non-rigid object feature extraction module. By introducing learnable offsets to modify the sampling positions of convolutional kernels on the input feature map, this approach enhances the network’s adaptability to object deformations and improves its feature extraction capability. In addition, this branch incorporates a spatial awareness module, which effectively aggregates contextual semantic information.

4.1.3. Semantic-Detail Fusion Branch

The semantic-detail fusion branch upsamples the intermediate features generated by the detailed branch and the semantic branch to the same size, and it then concatenates them to form the fusion result. This fusion result is restored to the size of the input image as the final predicted alpha matte through convolution operations.

4.2. Non-Rigid Object Feature Extraction Module

The non-rigid object feature extraction module is proposed to effectively capture the features of non-rigid objects, such as smoke and fire. It can dynamically adjust sampling positions based on local features, overcome the limitations of regular grids, and enhances the flexibility of contextual feature extraction.

As shown in Figure 6a, the standard convolution operation performs convolution on a fixed receptive field

R

(e.g., a

3 \times 3

grid), where each output feature

y (p_{0})

at spatial position

p_{0}

is computed as a weighted sum of input features

x (p_{0} + p_{n})

sampled at rigid offsets

p_{n} \in R

, with learnable weights

w (p_{n})

. The standard convolution operation takes a formula such as the following:

y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n}) .

(2)

Unlike traditional rigid objects, such as cups and flowers, smoke and fire are non-rigid objects characterized by strong diffusion, large-scale variations, and diverse shapes. For such non-rigid objects, a fixed receptive field struggles to capture generalized and discriminative features. Consequently, standard convolution faces certain limitations when dealing with objects of varying sizes and shapes.

Compared to standard convolution, deformable convolution introduces learnable offsets

Δ p_{n}

, allowing the sampling points of the convolution kernel to shift dynamically on the input feature map [59], which is shown in Figure 6b. The deformable convolution operation takes the following formula:

y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot x (p_{0} + p_{n} + Δ p_{n}),

(3)

where

Δ p_{n}

is dynamically predicted by an auxiliary convolutional layer applied to the input, enabling adaptive spatial sampling beyond the fixed grid

R

. The input feature x at fractional locations

(p_{0} + p_{n} + Δ p_{n})

is obtained via bilinear interpolation, allowing sub-pixel precision. This flexibility makes deformable convolution superior for modeling geometric transformations (e.g., scaling and rotation) in tasks like object detection, where

Δ p_{n}

effectively learns to "deform" the sampling grid around object boundaries or occluded regions. Since the sampling positions are generated by a learnable convolution kernel, deformable convolution can adaptively enlarge the receptive field based on specific feature content, thereby obtaining a more informative output feature map.

Therefore, our SFMattingNet network introduces a non-rigid object feature extraction module that is centered around the deformable convolution algorithm. This module replaces the fixed

3 \times 3

convolutions in the high-resolution detail branch with deformable convolutions to enable adaptive receptive fields. By dynamically adjusting sampling positions based on local features, it overcomes the limitations of regular grids and enhances the flexibility of contextual feature extraction. This is particularly effective for modeling smoke and fire, which exhibit irregular shapes and diffuse boundaries, thereby improving feature representation and downstream task performance.

4.3. Spatial Awareness Module

In smoke and fire matting tasks, there are many objects with similar visual characteristics, such as clouds and snow, which are easily misidentified as smoke or fire. Therefore, it is essential to capture global contextual information to achieve accurate smoke and fire matting; thus, the spatial awareness module is proposed.

Specifically, to perceive the spatial correlation between pixels and to obtain pixel-wise contextual information, the non-local block [60] utilizes a dense self-attention mechanism, allowing any position to perceive features from all other positions, thus obtaining global context information. As shown in Figure 7a, given a feature map

F \in R^{H \times W \times C}

, the dense self-attention mechanism first generated three feature maps

Q, K

, and

V

, where

{Q, K, V} \in R^{H \times W \times C}

. For a pixel u marked in blue, its feature update process can be formulated as follows:

F_{u}^{'} = s o f t m a x (\frac{Q_{u} K^{T}}{\sqrt{C}}) V,

(4)

where

Q_{u} \in R^{c}

is the query vector of the pixel u, and the attention map is calculated via

s o f t m a x (\frac{Q_{u} K^{T}}{\sqrt{C}}) \in R^{H \times W}

, which is marked in green in Figure 7a. The updated feature map

F_{u}^{'}

of the pixel u remains within the space

R^{C}

. However, this method relies on a large-scale attention map within the space

R^{H \times W}

to measure relationships between the pixel u and all other pixels, which results in high computational complexity.

To address these issues, the criss-cross attention block [61] replaces the single-layer dense attention in the non-local block with continuous sparse attention. The criss-cross attention block performs (H+W-1) sparse connections at each position in the feature map, aggregating the contextual information of all pixels along the cross path of each pixel. As shown in Figure 7b, for the pixel u marked in blue, the attention map is calculated with a set of key vectors

\tilde{K} = {K_{i}}

, where

i \in [1, 2, . . ., H + W - 1]

and

\tilde{K} \in R^{H + W - 1, C}

. These key vectors

{K_{i}}

are obtained along the cross path of the pixel u, and each

K_{i}

remains within the space

R^{C}

. Thus, the feature update process of the pixel u in the criss-cross attention algorithm takes the following formula:

F_{u}^{'} = s o f t m a x (\frac{Q_{u} {\tilde{K}}^{T}}{\sqrt{C}}) \tilde{V},

(5)

where

\tilde{V} \in R^{H + W - 1, C}

is the value feature obtained along the cross path of the pixel u, and

F_{u}^{'} \in R^{C}

is the updated feature map of the pixel u. The attention map is generated with

s o f t m a x (\frac{Q_{u} {\tilde{K}}^{T}}{\sqrt{C}}) \in R^{H + W - 1}

, and thus its computation is greatly reduced compared with Equation (4). By adopting further recursive operations, the relationship between the target pixel and all other pixels in the feature map is established, and, based on this relationship, the features of the target pixel are weighted and aggregated. This algorithm significantly reduces computational complexity while effectively capturing target features. Therefore, the spatial awareness module consisting of a criss-cross attention block is established. With relatively low computational overhead, this module can effectively captures long-range contextual dependencies across the entire image while minimizing false detection in feature-similar regions, thereby further enhancing the accuracy of trimap-free image matting methods.

4.4. Loss Function

4.4.1. The Loss Function for Low-Resolution Semantic Estimation Branch

The loss function for a low-resolution semantic estimation branch can be expressed as follows:

L_{s} = \frac{1}{2} {∥s_{p} - G (α_{g})∥}_{2},

(6)

where

G (\cdot)

represents the Gaussian blur operation followed by a

16 \times

downsampling operation,

s_{p}

denotes the prediction results of the semantic branch, and

α_{g}

represents the ground truth labels.

4.4.2. The Loss Function for High-Resolution Detail Estimation Branch

The loss function for a high-resolution detail estimation branch can be expressed as follows:

L_{d} = m_{d} {∥d_{p} - α_{g}∥}_{1},

(7)

where

d_{p}

represents the prediction results of the detail branch;

α_{g}

represents the binary mask of the image’s transition region, which is obtained by applying dilation and erosion operations on the ground truth labels;

m_{d}

is computed via dilation and erosion of the ground truth, where pixels within the transition region are assigned a value of 1, or 0 otherwise; and

{∥\cdot∥}_{1}

is L1 Norm.

4.4.3. The Loss Function for Semantic-Detail Fusion Branch

The loss function for the semantic-detail fusion branch can be expressed as follows:

L_{α} = {∥α_{p} - α_{g}∥}_{1} + L_{c},

(8)

where

α_{p}

denotes the network’s predicted results, and

L_{c}

represents the compositional loss [20], which computes the absolute difference between the original image and the composited image generated using

α_{p}

, along with the ground truth foreground and background.

5. Experiments

5.1. Evaluation Metrics

To comprehensively and accurately quantify the prediction accuracy of image matting methods, four evaluation metrics were employed: the Sum of Absolute Differences (SAD), Mean Square Error (MSE), Gradient Error (Grad), and Connectivity Error (Conn). In the following text, each of these evaluation metrics are introduced in detail. However, traditional image matting methods and trimap-based image matting methods rely on trimaps as supervisory information, causing their errors to concentrate in the transition regions. In contrast, trimap-free deep image matting methods lack trimap supervision and must predict across the entire image, leading to errors potentially being distributed throughout the image. To more accurately assess the performance of different algorithms in various regions, the metrics SAD-ALL, MSE-ALL, Grad-ALL, and Conn-ALL were introduced. These metrics calculate the errors over the entire image, with n representing the total number of pixels in the image. The metrics SAD, MSE, Grad, and Conn focus primarily on errors in the transition regions, where n represents the number of pixels in the transition area. In the following text, each of these evaluation metrics are introduced in detail.

SAD is the sum of the absolute differences between the predicted alpha values of each pixel in the alpha matte and the corresponding ground truth alpha values. The function can be expressed as follows:

S A D = \sum_{i = 1}^{n} |α_{i} - α_{i}^{*}|,

(9)

where

α_{i}

and

α_{i}^{*}

represent the predicted alpha value and the ground truth alpha value at the pixel i, respectively, and n is the number of pixels in the transition region or the entire image.

MSE calculates the mean of the squared differences between the predicted alpha values and the corresponding ground truth alpha values for each pixel in the alpha matte. The function can be expressed as follows:

M S E = \frac{1}{n} \sum_{i} {(α_{i} - α_{i}^{*})}^{2} .

(10)

Grad is used to evaluate the over-smoothing of the predicted alpha matte. It is defined as the sum of the normalized gradient differences between the predicted alpha values and the corresponding ground truth alpha values for each pixel. Specifically, the gradient value of each pixel is obtained by convolving the alpha values with a first-order Gaussian derivative filter. The function can be expressed as follows:

G r a d = \sum_{i = 1}^{n} {(\nabla α_{i} - \nabla α_{i}^{*})}^{2},

(11)

where ∇ is the normalized gradient operator.

Conn measures the connectivity of the predicted alpha matte, which is determined by the connectivity of the binary image obtained through thresholding the alpha matte. The function can be expressed as follows:

C o n n = \sum_{i = 1}^{n} {(φ (α_{i}, Ω) - φ (α_{i}^{*}, Ω))}^{2},

(12)

where function

φ (\cdot)

calculates the connectivity of pixel i with the largest connected component

Ω

, which is the region where both the predicted alpha matte and the ground truth alpha values are equal to 1.

SAD and MSE reflect the numerical differences between the predicted alpha matte and the ground truth labels. Grad captures the errors at the foreground edges in the predicted alpha matte. Conn measures the difference in connectivity between the predicted alpha matte and the ground truth labels. The latter two metrics correspond to subjective visual evaluations of the prediction quality based on human perception.

5.2. Overall Experiments

In order to establish benchmarks for image matting on our proposed SFMatting-800 dataset, several image matting methods were tested, including traditional image matting methods (CF Matting [20], KNN Matting [21], LKM [56], LBDM [55], and RW Matting [22]), trimap-based deep image matting methods (DIM [28], Guided Contextual Attention (GCA) Matting [39], and IndexNet [38]), and trimap-free deep image matting methods (MODNet [18], PP-Matting [47], Global and Fine-grained Matting (GFM) [46], and also our proposed SFMattingNet).

5.2.1. Traditional Image Matting Methods

The quantitative experimental results of the traditional image matting methods are shown in Table 1, where smaller metric values indicate higher prediction accuracy. When SAD, MSE, and Conn were used as key evaluation metrics, KNN Matting performed the best among all traditional methods, suggesting that the prediction results of KNN Matting are highly similar to the ground truth labels. When Grad was used as the key evaluation metric, the LKM algorithm outperformed the other methods, indicating that the LKM algorithm has an advantage in predicting edge details and preserving delicate texture information. In KNN Matting, the alpha value of each pixel is obtained by performing a weighted average of its nearest neighboring alpha values. This algorithm leads to overly smooth prediction results and the loss of detailed information. As shown in Figure 8, KNN Matting accurately predicts in relatively smooth textured regions but performs poorly at the edges. The LKM algorithm, on the other hand, uses Gaussian weights for linear combinations, effectively preserving edge texture information. The prediction results of RW Matting exhibit a halo effect, which occurs because the algorithm uses an averaging method to redistribute pixel values, leading to excessive smoothing and to a loss of texture details.

5.2.2. Trimap-Based Deep Image Matting Methods

The quantitative experimental results of the trimap-based deep image matting methods are shown in Table 1. Compared to other methods, DIM outperformed the others in SAD, Conn, and Grad metrics, indicating that DIM produced the highest quality prediction results. However, despite DIM having the smallest SAD value, its MSE value was the largest. This phenomenon occurred because the errors were mainly concentrated in pixels that deviated significantly from the ground truth, and the squaring operation, compared to the absolute value operation, tended to amplify the errors. Furthermore, a visual comparison of the trimap-based methods on the SFMatting-800 test set is also provided, which is shown in Figure 9. By observing the visualization results in Columns 4 to 6 of Figure 9, it is evident that IndexNet and GCA Matting produced unnatural lines due to sudden changes in foreground alpha values. This issue arose because both methods only predicted the transition region in the trimap and directly included the absolute foreground and background regions as part of the result. In contrast, the matting refinement module in DIM can avoid this issue as DIM directly learns fine-grained details from the original image, even handling cases where there is no absolute foreground in the trimap.

5.2.3. Trimap-Free Deep Image Matting Methods

The quantitative experimental results of the trimap-free deep image matting methods are shown in Table 2. Since these methods are no longer constrained by trimaps, the trimap-free deep image matting methods offered greater flexibility and generalizability. Compared with all other trimap-free baseline methods, MODNet acquires more minor errors in most metrics, which means it obtains relatively good smoke and fire matting performance. However, MODNet occasionally misidentifies some similar objects as the target, which is shown in the fourth column of Figure 10, where it incorrectly identified clouds as smoke. In the fifth column of Figure 10, the visualization results of GFM Matting show that the algorithm amplified the alpha values in the foreground region. The visualization results of PP-Matting reveal unnatural transitions in the foreground segmentation, which can be attributed to the simple overlay of semantic and detail maps during the fusion process. According to the quantitative experimental results in Table 2, except for MSE, three evaluation metrics increased as the computation area expanded, leading to larger error values. The MSE value for the entire image was smaller than the MSE value for the transition region because the MSE was obtained by averaging, and the non-transition regions were predicted more accurately and occupied a larger area, leading to a minor overall error after averaging.

The performance of our SFMattingNet approach compared with the other trimap-free baselines is also shown in Table 2. Our SFMattingNet acquired the lowest levels in almost all metrics, achieving an average error reduction of 12.65% compared to the suboptimal approach MODNet. The visualization results are shown in Figure 10. As shown in the seventh column of Figure 10, the SFMattingNet method effectively handled challenging boundaries and texture details, while also providing reliable technical support for more refined image matting and compositing tasks in real-world applications. In particular, by comparing the fourth and seventh images in the second row of Figure 10, MODNet incorrectly identified clouds in non-transition regions as smoke. In contrast, SFMattingNet effectively avoided this issue due to its spatial awareness module. In summary, with the integration of a spatial awareness module and a non-rigid object feature extraction module, SFMattingNet achieved state-of-the-art performance.

5.3. Ablation Experiments

5.3.1. Quantitative Analysis

To thoroughly validate the effectiveness of each module in the SFMattingNet approach, several ablation experiments wee conducted. More specifically, we manually removed the spatial awareness module, the non-rigid object feature extraction module, or both to assess their contributions. The corresponding experimental results are shown in Table 3. Compared with the baseline (no modules), although the solo spatial awareness module brought no significant error decrease in the four metrics SAD, MSE, Grad, and Conn, the metrics SAD-ALL (43.881 vs. 36.627), MSE-ALL (0.013 vs. 0.011), Grad-ALL (3.697 vs. 3.513), and Conn-ALL (42.704 vs. 35.707) all showed a significant decline, achieving an average error reduction of 11.30%. Such results indicate that, while the accuracy within the transition region did not improve notably, there was a significant improvement in accuracy across the entire image. This suggests that the spatial awareness module enhances prediction accuracy in non-transition regions by globally capturing the long-range contextual dependencies across the entire image. Furthermore, under the exclusive effect of the non-rigid object feature extraction module, the four metrics SAD, MSE, Grad, and Conn exhibited a noticeable decrease, with an average decline of 6.29%, indicating improved accuracy within the transition region. This indicates that the non-rigid object feature extraction module can adapt to the highly variable shapes of non-rigid objects by dynamically adjusting the receptive field, thereby improving the matting accuracy in the transition region. Moreover, when both modules worked together, they complemented each other, and the five evaluation metrics SAD, Conn, SAD-ALL, MSE-ALL, and Conn-ALL all reached the lowest levels. In addition, SFMattingNet achieved comparable performance on the metrics of MSE, Grad, and Grad-All. In summary, with the help of these two modules, our SFMattingNet method demonstrated optimal performance and achieved the highest level of prediction accuracy.

5.3.2. Visual Analysis

From the previous analysis, the errors of trimap-free deep image matting methods mainly occurred in the non-transition regions, i.e., the black and white areas of the trimap. As shown in the third column of Figure 11, when the background and object were similar, MODNet demonstrated suboptimal performance in the non-transition regions, such as mistaking clouds and snow in the background for smoke, as indicated by the red boxes in Figure 11. As a result, our proposed spatial awareness module was applied to address this issue by capturing long-range contextual dependencies across the entire image. This enables the module to better distinguish between visually similar foreground and background elements. Through a thorough analysis of the visualization results in Figure 11, it is clearly evident that, with the spatial awareness module, the recognition accuracy in the non-transition regions was significantly improved, enhancing the performance of SFMattingNet. In addition, as indicated by the quantitative analysis in Table 3, the non-rigid feature extraction module significantly improved the accuracy in the transition region. However, since the transition region occupied only a tiny portion of the entire image, the visual effect was not prominent. Therefore, this paper does not present the visualization results of this module’s standalone effect.

6. Conclusions

In this paper, we propose a smoke and fire image matting dataset, SFMatting-800, and a trimap-free deep image matting method, SFMattingNet. We benchmarked existing representative methods’ performances through extensive experiments on the SFMatting-800 dataset, which provided significant support for better model design. With the utilization of the non-rigid object feature extraction module and the spatial awareness module, our proposed SFMattingNet achieved state-of-the-art performance on the smoke and fire image matting task. Compared to the suboptimal approach MODNet, SFMattingNet achieved an average error reduction of 12.65% in the smoke and fire matting task. In future research, we aim to combine deep learning with sensor technologies, using image data along with parameters like temperature, light intensity, and gas concentration to enhance the understanding of smoke and fire characteristics, thereby improving smoke and fire detection accuracy and robustness.

Author Contributions

Conceptualization, S.M. and H.Y.; methodology, S.M., Z.X. and H.Y.; software, S.M. and Z.X.; validation, S.M. and Z.X.; investigation, S.M. and Z.X.; resources, S.M. and H.Y.; data curation, S.M. and Z.X.; writing—original draft preparation, S.M.; writing—review and editing, S.M., Z.X. and H.Y.; visualization, S.M.; supervision, Z.X. and H.Y.; project administration, H.Y.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data that support the findings of this study are available from the corresponding authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, H.; Jin, J.; Liu, Y.; Guo, Y.; Shen, Y. FSDF: A high-performance fire detection framework. Expert Syst. Appl. 2024, 238, 121665. [Google Scholar] [CrossRef]
Celik, T. Fast and efficient method for fire detection using image processing. ETRI J. 2010, 32, 881–890. [Google Scholar] [CrossRef]
Dunnings, A.J.; Breckon, T.P. Experimentally defined convolutional neural network architecture variants for non-temporal real-time fire detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1558–1562. [Google Scholar]
de Venancio, P.V.A.; Lisboa, A.C.; Barbosa, A.V. An automatic fire detection system based on deep convolutional neural networks for low-power, resource-constrained devices. Neural Comput. Appl. 2022, 34, 15349–15368. [Google Scholar] [CrossRef]
Yan, S.; Zhang, J.; Barnes, N. Transmission-guided bayesian generative model for smoke segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Online Conference, 22 February–1 March 2022; pp. 3009–3017. [Google Scholar]
Muhammad, K.; Khan, S.; Elhoseny, M.; Ahmed, S.H.; Baik, S.W. Efficient fire detection for uncertain surveillance environment. IEEE Trans. Ind. Inform. 2019, 15, 3113–3122. [Google Scholar] [CrossRef]
Kaabi, R.; Sayadi, M.; Bouchouicha, M.; Fnaiech, F.; Moreau, E.; Ginoux, J.M. Early smoke detection of forest wildfire video using deep belief network. In Proceedings of the International Conference on Advanced Technologies for Signal and Image Processing, Sousse, Tunisia, 21–24 March 2018; pp. 1–6. [Google Scholar]
Khan, S.; Muhammad, K.; Hussain, T.; Del Ser, J.; Cuzzolin, F.; Bhattacharyya, S.; Akhtar, Z.; de Albuquerque, V.H.C. Deepsmoke: Deep learning model for smoke detection and segmentation in outdoor environments. Expert Syst. Appl. 2021, 182, 115125. [Google Scholar] [CrossRef]
Yuan, F.; Zhang, L.; Xia, X.; Huang, Q.; Li, X. A gated recurrent network with dual classification assistance for smoke semantic segmentation. IEEE Trans. Image Process. 2021, 30, 4409–4422. [Google Scholar] [CrossRef]
Li, X.; Chen, Z.; Wu, Q.J.; Liu, C. 3D parallel fully convolutional networks for real-time video wildfire smoke detection. IEEE Trans. Circuits Syst. Video Technol. 2018, 30, 89–103. [Google Scholar] [CrossRef]
Yuan, F.; Zhang, L.; Xia, X.; Huang, Q.; Li, X. A wave-shaped deep neural network for smoke density estimation. IEEE Trans. Image Process. 2019, 29, 2301–2313. [Google Scholar] [CrossRef]
Zhang, Q.; Lin, G.; Zhang, Y.; Xu, G.; Wang, J. Wildland forest fire smoke detection based on faster R-CNN using synthetic smoke images. Procedia Eng. 2018, 211, 441–446. [Google Scholar] [CrossRef]
Sun, Y.; Feng, J. Fire and smoke precise detection method based on the attention mechanism and anchor-free mechanism. Complex Intell. Syst. 2023, 9, 5185–5198. [Google Scholar] [CrossRef]
Luo, Y.; Zhao, L.; Liu, P.; Huang, D. Fire smoke detection algorithm based on motion characteristic and convolutional neural networks. Multimed. Tools Appl. 2018, 77, 15075–15092. [Google Scholar] [CrossRef]
Li, J.; Zhang, J.; Tao, D. Deep image matting: A comprehensive survey. arXiv 2023, arXiv:2304.04672. [Google Scholar]
Zolfi, A.; Kravchik, M.; Elovici, Y.; Shabtai, A. The translucent patch: A physical and universal attack on object detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Online Conference, 19–25 June 2021; pp. 15232–15241. [Google Scholar]
Ma, S.; Ding, K.; Yan, H. SFMatting-800: A multi-scene smoke and fire image matting dataset for fine-grained fire detection. In Proceedings of the 4th International Conference on Artificial Intelligence and Computer Engineering, Dalian, China, 17–19 November 2023; pp. 22–30. [Google Scholar]
Ke, Z.; Sun, J.; Li, K.; Yan, Q.; Lau, R.W.H. MODNet: Real-Time Trimap-Free Portrait Matting via Objective Decomposition. In Proceedings of the AAAI Conference on Artificial Intelligence, Online Conference, 22 February–1 March 2022; pp. 1140–1147. [Google Scholar]
Cai, H.; Xue, F.; Xu, L.; Guo, L. TransMatting: Tri-token equipped transformer model for image matting. arXiv 2023, arXiv:2303.06476. [Google Scholar]
Levin, A.; Lischinski, D.; Weiss, Y. A Closed-Form Solution to Natural Image Matting. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 228–242. [Google Scholar] [CrossRef]
Chen, Q.; Li, D.; Tang, C. KNN Matting. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2175–2188. [Google Scholar] [CrossRef]
Grady, L.; Schiwietz, T.; Aharon, S.; Westermann, R. Random walks for interactive alpha-matting. In Proceedings of the ICVIP, Benidorm, Spain, 7–9 September 2005; pp. 423–429. [Google Scholar]
Sun, J.; Jia, J.; Tang, C.K.; Shum, H.Y. Poisson matting. ACM Trans. Graph. 2004, 23, 315–321. [Google Scholar] [CrossRef]
He, K.; Rhemann, C.; Rother, C.; Tang, X.; Sun, J. A global sampling method for alpha matting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO, USA, 20–25 June 2011; pp. 2049–2056. [Google Scholar]
Wang, J.; Cohen, M.F. Optimized color sampling for robust matting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
Shahrian, E.; Rajan, D.; Price, B.; Cohen, S. Improving image matting using comprehensive sampling sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 636–643. [Google Scholar]
Feng, X.; Liang, X.; Zhang, Z. A cluster sampling method for image matting via sparse coding. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 204–219. [Google Scholar]
Xu, N.; Price, B.L.; Cohen, S.; Huang, T.S. Deep Image Matting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 311–320. [Google Scholar]
Yu, Q.; Zhang, J.; Zhang, H.; Wang, Y.; Lin, Z.; Xu, N.; Bai, Y.; Yuille, A. Mask guided matting via progressive refinement network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Online Conference, 19–25 June 2021; pp. 1154–1163. [Google Scholar]
Sengupta, S.; Jayaram, V.; Curless, B.; Seitz, S.M.; Kemelmacher-Shlizerman, I. Background matting: The world is your green screen. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2291–2300. [Google Scholar]
Yang, X.; Qiao, Y.; Chen, S.; He, S.; Yin, B.; Zhang, Q.; Wei, X.; Lau, R.W. Smart scribbles for image matting. ACM Trans. Multimed. Comput. Commun. Appl. 2020, 16, 1–21. [Google Scholar] [CrossRef]
Ding, H.; Zhang, H.; Liu, C.; Jiang, X. Deep interactive image matting with feature propagation. IEEE Trans. Image Process. 2022, 31, 2421–2432. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online Conference, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Lutz, S.; Amplianitis, K.; Smolic, A. Alphagan: Generative adversarial networks for natural image matting. arXiv 2018, arXiv:1807.10088. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4401–4410. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Lu, H.; Dai, Y.; Shen, C.; Xu, S. Indices Matter: Learning to Index for Deep Image Matting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3265–3274. [Google Scholar]
Li, Y.; Lu, H. Natural Image Matting via Guided Contextual Attention. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11450–11457. [Google Scholar]
Tang, J.; Aksoy, Y.; Oztireli, C.; Gross, M.; Aydin, T.O. Learning-based sampling for natural image matting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3055–3063. [Google Scholar]
Hou, Q.; Liu, F. Context-aware image matting for simultaneous foreground and alpha estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4130–4139. [Google Scholar]
Cai, S.; Zhang, X.; Fan, H.; Huang, H.; Liu, J.; Liu, J.; Liu, J.; Wang, J.; Sun, J. Disentangled image matting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8819–8828. [Google Scholar]
Shen, X.; Tao, X.; Gao, H.; Zhou, C.; Jia, J. Deep automatic portrait matting. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 92–107. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Zhang, Y.; Gong, L.; Fan, L.; Ren, P.; Huang, Q.; Bao, H.; Xu, W. A late fusion cnn for digital matting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7469–7478. [Google Scholar]
Li, J.; Zhang, J.; Maybank, S.J.; Tao, D. Bridging Composite and Real: Towards End-to-End Deep Image Matting. Int. J. Comput. Vis. 2022, 130, 246–266. [Google Scholar] [CrossRef]
Chen, G.; Liu, Y.; Wang, J.; Peng, J.; Hao, Y.; Chu, L.; Tang, S.; Wu, Z.; Chen, Z.; Yu, Z.; et al. PP-Matting: High-Accuracy Natural Image Matting. arXiv 2022, arXiv:2204.09433. [Google Scholar]
Li, J.; Zhang, J.; Tao, D. Deep automatic natural image matting. In Proceedings of the International Joint Conference on Artificial Intelligence, Online Conference, 19–26 August 2021; pp. 800–806. [Google Scholar]
Rhemann, C.; Rother, C.; Wang, J.; Gelautz, M.; Kohli, P.; Rott, P. A perceptually motivated online benchmark for image matting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1826–1833. [Google Scholar]
Qiao, Y.; Liu, Y.; Yang, X.; Zhou, D.; Xu, M.; Zhang, Q.; Wei, X. Attention-guided hierarchical structure aggregation for image matting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 13676–13685. [Google Scholar]
Liu, Y.; Xie, J.; Shi, X.; Qiao, Y.; Huang, Y.; Tang, Y.; Yang, X. Tripartite information mining and integration for image matting. In Proceedings of the International Conference on Computer Vision, Online Conference, 11–17 October 2021; pp. 7555–7564. [Google Scholar]
Li, J.; Ma, S.; Zhang, J.; Tao, D. Privacy-preserving portrait matting. In Proceedings of the ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 3501–3509. [Google Scholar]
Wu, S.; Zhang, X.; Liu, R.; Li, B. A dataset for fire and smoke object detection. Multimed. Tools Appl. 2023, 82, 6707–6726. [Google Scholar] [CrossRef]
Chino, D.Y.; Avalhais, L.P.; Rodrigues, J.F.; Traina, A.J. Bowfire: Detection of fire in still images by integrating pixel color and texture analysis. In Proceedings of the Conference on Graphics, Patterns and Images, Salvador, Brazil, 26–29 June 2015; pp. 95–102. [Google Scholar]
Zheng, Y.; Kambhamettu, C. Learning based digital matting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Kyoto, Japan, 20–25 June 2009; pp. 889–896. [Google Scholar]
He, K.; Sun, J.; Tang, X. Fast matting using large kernel matting Laplacian matrices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2165–2172. [Google Scholar]
Berman, A.; Vlahos, P.; Dadourian, A. Comprehensive Method for Removing From an Image the Background Surrounding a Selected Subject. U.S. Patent 6,134,345, 17 October 2000. [Google Scholar]
Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7482–7491. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the International Conference on Computer Vision, 27 October–2 November 2019; pp. 603–612. [Google Scholar]

Figure 1. Image matting and alpha composition.

Figure 2. The framework of deep image matting [28].

Figure 3. Dataset generation flowchart.

Figure 4. Examples from the SFMatting-800 dataset.

Figure 5. The framework of SFMattingNet.

Figure 6. (a) Fixed receptive field in standard convolution; (b) adaptive receptive field in deformable convolution.

Figure 7. Two attention-based context information aggregation methods: (a) non-local block; (b) criss-cross attention block [61].

Figure 8. Visualization comparison of the evaluation results for traditional image matting methods on the SFMatting-800 test set.

Figure 9. Visualization comparison of the evaluation results for the trimap-based deep image matting methods on the SFMatting-800 test set.

Figure 10. Visualization comparison of the evaluation results for the trimap-free deep image matting methods on the SFMatting-800 test set.

Figure 11. Visualization comparison of the results under the effect of the spatial awareness module (errors highlighted in red boxes).

Table 1. Quantitative evaluation of the traditional and trimap-based image matting methods on the SFMatting-800 test set.

Method Group	Method	SAD ↓	MSE ↓	Grad ↓	Conn ↓
Traditional Matting Methods	CF Matting [20]	23.402	0.067	3.670	23.838
	KNN Matting [21]	19.108	0.035	4.620	19.587
	LKM [56]	21.009	0.042	3.583	21.186
	RW Matting [22]	35.599	0.111	6.290	37.226
	LBDM [55]	26.990	0.080	7.361	27.399
Trimap-based Matting Methods	DIM [28]	13.038	0.034	2.189	13.051
	IndexNet [38]	18.910	0.025	2.525	18.760
	GCA Matting [39]	18.571	0.024	2.170	18.426

↓ denotes lower-is-better for matting accuracy.

Table 2. Quantitative evaluation of the trimap-free image matting methods on the SFMatting-800 test set.

Method	SAD ↓	SAD- ALL ↓	MSE ↓	MSE- ALL ↓	Grad ↓	Grad- ALL ↓	Conn ↓	Conn- ALL ↓
MODNet [18]	15.131	43.881	0.031	0.013	2.231	3.697	15.046	42.704
GFM [46]	17.746	50.658	0.049	0.025	2.558	5.744	17.966	50.224
PP-Matting [47]	17.256	48.853	0.058	0.026	2.699	5.141	17.111	48.470
SFMattingNet (ours)	13.161	33.553	0.029	0.011	2.293	3.485	12.734	32.262

↓ denotes lower-is-better for matting quality. The best performances are marked in bold.

Table 3. The ablation experiment results of each module in the SFMattingNet approach on the SFMatting-800 test set.

Module1	Module2	SAD	SAD-ALL	MSE	MSE-ALL	Grad	Grad-ALL	Conn	Conn-ALL
×	×	15.131	43.881	0.031	0.013	2.231	3.697	15.046	42.704
✓	×	15.854	36.627	0.030	0.011	2.263	3.513	15.782	35.707
×	✓	13.978	37.501	0.028	0.012	2.244	3.411	13.775	36.524
✓	✓	13.161	33.553	0.029	0.011	2.293	3.485	12.734	32.262

Module1 is the spatial awareness module. Module2 is the non-rigid object feature extraction module. The best performances are marked in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, S.; Xu, Z.; Yan, H. SFMattingNet: A Trimap-Free Deep Image Matting Approach for Smoke and Fire Scenes. Remote Sens. 2025, 17, 2259. https://doi.org/10.3390/rs17132259

AMA Style

Ma S, Xu Z, Yan H. SFMattingNet: A Trimap-Free Deep Image Matting Approach for Smoke and Fire Scenes. Remote Sensing. 2025; 17(13):2259. https://doi.org/10.3390/rs17132259

Chicago/Turabian Style

Ma, Shihui, Zhaoyang Xu, and Hongping Yan. 2025. "SFMattingNet: A Trimap-Free Deep Image Matting Approach for Smoke and Fire Scenes" Remote Sensing 17, no. 13: 2259. https://doi.org/10.3390/rs17132259

APA Style

Ma, S., Xu, Z., & Yan, H. (2025). SFMattingNet: A Trimap-Free Deep Image Matting Approach for Smoke and Fire Scenes. Remote Sensing, 17(13), 2259. https://doi.org/10.3390/rs17132259

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SFMattingNet: A Trimap-Free Deep Image Matting Approach for Smoke and Fire Scenes

Abstract

1. Introduction

2. Related Work

2.1. Image Matting

2.1.1. Traditional Matting Methods

2.1.2. Deep Learning-Based Image Matting Methods

2.2. Image Matting Dataset

2.3. Deep Learning-Based Smoke and Fire Detection Methods

2.4. Smoke and Fire Dataset

3. Datasets Generation

3.1. Data Collection

3.2. Data Filtering

3.3. Data Annotation

3.4. Data Quality Control

4. Methods

4.1. Framework of SFMattingNet

4.1.1. Low-Resolution Semantic Estimation Branch

4.1.2. High-Resolution Detail Estimation Branch

4.1.3. Semantic-Detail Fusion Branch

4.2. Non-Rigid Object Feature Extraction Module

4.3. Spatial Awareness Module

4.4. Loss Function

4.4.1. The Loss Function for Low-Resolution Semantic Estimation Branch

4.4.2. The Loss Function for High-Resolution Detail Estimation Branch

4.4.3. The Loss Function for Semantic-Detail Fusion Branch

5. Experiments

5.1. Evaluation Metrics

5.2. Overall Experiments

5.2.1. Traditional Image Matting Methods

5.2.2. Trimap-Based Deep Image Matting Methods

5.2.3. Trimap-Free Deep Image Matting Methods

5.3. Ablation Experiments

5.3.1. Quantitative Analysis

5.3.2. Visual Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI