A Semi-Supervised Wildfire Image Segmentation Network with Multi-Scale Structural Fusion and Pixel-Level Contrastive Consistency

Sun, Yong; Wei, Wei; Guo, Jia; Lin, Haifeng; Xu, Yiqing

doi:10.3390/fire8080313

Open AccessArticle

A Semi-Supervised Wildfire Image Segmentation Network with Multi-Scale Structural Fusion and Pixel-Level Contrastive Consistency

by

Yong Sun

¹,

Wei Wei

¹,

Jia Guo

¹,

Haifeng Lin

²

and

Yiqing Xu

^3,*

¹

School of Computer and Artificial Intelligence, Nanjing University of Science and Technology ZiJin College, Nanjing 210023, China

²

College of Information Science and Technology & Artificial Intelligence, Nanjing Forestry University, Nanjing 210037, China

³

School of Computer and Software, Nanjing University of Industry Technology, Nanjing 210023, China

^*

Author to whom correspondence should be addressed.

Fire 2025, 8(8), 313; https://doi.org/10.3390/fire8080313

Submission received: 28 May 2025 / Revised: 9 July 2025 / Accepted: 5 August 2025 / Published: 7 August 2025

Download

Browse Figures

Versions Notes

Abstract

The increasing frequency and intensity of wildfires pose serious threats to ecosystems, property, and human safety worldwide. Accurate semantic segmentation of wildfire images is essential for real-time fire monitoring, spread prediction, and disaster response. However, existing deep learning methods heavily rely on large volumes of pixel-level annotated data, which are difficult and costly to obtain in real-world wildfire scenarios due to complex environments and urgent time constraints. To address this challenge, we propose a semi-supervised wildfire image segmentation framework that enhances segmentation performance under limited annotation conditions by integrating multi-scale structural information fusion and pixel-level contrastive consistency learning. Specifically, a Lagrange Interpolation Module (LIM) is designed to construct structured interpolation representations between multi-scale feature maps during the decoding stage, enabling effective fusion of spatial details and semantic information, and improving the model’s ability to capture flame boundaries and complex textures. Meanwhile, a Pixel Contrast Consistency (PCC) mechanism is introduced to establish pixel-level semantic constraints between CutMix and Flip augmented views, guiding the model to learn consistent intra-class and discriminative inter-class feature representations, thereby reducing the reliance on large labeled datasets. Extensive experiments on two public wildfire image datasets, Flame and D-Fire, demonstrate that our method consistently outperforms other approaches under various annotation ratios. For example, with only half of the labeled data, our model achieves 5.0% and 6.4% mIoU improvements on the Flame and D-Fire datasets, respectively, compared to the baseline. This work provides technical support for efficient wildfire perception and response in practical applications.

Keywords:

semi-supervised semantic segmentation; multi-scale feature fusion; forest fire prevention; contrastive learning

1. Introduction

In recent years, the global wildfire situation has become increasingly severe due to the combined effects of global warming and extreme weather events. Wildfires cause devastating ecological damage, threaten economic resources, and endanger human safety [1,2]. Timely and accurate detection of wildfire-affected areas is critical for effective fire suppression, resource allocation, and ecological protection [3,4]. Among various image analysis tasks, semantic segmentation plays a particularly important role by providing pixel-level classification of wildfire regions, which enables the precise delineation of flames, smoke, and background. However, the development of accurate segmentation models is severely hindered by the scarcity of high-quality annotated wildfire images. In real-world scenarios, data collection is costly and time-sensitive, while manual annotation requires significant expertise and resources. To overcome these limitations, there is an urgent need for methods that can reduce dependence on large-scale labeled datasets while maintaining high segmentation accuracy in complex wildfire environments.

Currently, wildfire monitoring systems typically rely on a combination of sensor-based data acquisition and algorithmic data analysis. Sensors, such as infrared detectors, smoke sensors, temperature sensors, drones, and satellite platforms, provide critical real-time environmental data for fire detection. In the early stages, traditional wildfire monitoring methods often relied on manually designed rule-based algorithms or classical machine learning techniques to analyze sensor data. These approaches, however, usually required manual feature extraction and were easily affected by environmental factors such as weather conditions, lighting variations, and complex terrain, leading to reduced detection accuracy and stability. In recent years, deep learning-based methods have emerged as powerful alternatives for analyzing sensor-collected wildfire data. These methods leverage the representation learning capability of deep neural networks to automatically extract high-level semantic features from visual data, such as images captured by UAVs or surveillance cameras. Compared to traditional machine learning approaches, deep learning models demonstrate superior performance in handling complex, high-dimensional, and nonlinear wildfire scenes, leading to improved segmentation accuracy, robustness, and real-time detection efficiency. Nevertheless, the effectiveness of deep learning still largely depends on the availability of large-scale high-quality annotated data, which remains a significant bottleneck due to the cost and difficulty of obtaining labeled wildfire imagery.

In deep learning-based wildfire monitoring, two key aspects must be considered: the acquisition of reliable data and the development of effective data analysis methods. Adverse weather conditions, complex terrain, or limited sensor coverage can significantly affect the availability and quality of wildfire imagery, making fire detection and monitoring more challenging. However, when sensor systems such as UAVs, satellites, or ground-based cameras provide suitable data, deep learning techniques have demonstrated great potential to enhance the accuracy and robustness of wildfire image analysis. Among existing image analysis approaches, object detection and semantic segmentation are two commonly applied tasks. Object detection focuses on identifying the presence and location of fire regions, but it often struggles to accurately delineate fire boundaries or capture detailed scene information [5,6]. In comparison, semantic segmentation performs pixel-level classification, allowing for the precise separation of flames, smoke, and background [7,8]. This enables a finer understanding of wildfire scenes, which is essential for analyzing fire structure, assessing spread patterns, and supporting firefighting decisions, especially under complex environmental conditions such as heterogeneous vegetation or uneven terrain. Therefore, when reliable data are available, semantic segmentation has become an indispensable component of intelligent wildfire monitoring systems.

However, current semantic segmentation models for wildfire images still face two main challenges. First, the feature extraction is often performed at a single scale. Since wildfire images typically contain multi-scale fire regions and complex backgrounds, single-scale features are insufficient to capture the detailed semantics, resulting in reduced accuracy in edge localization and overall segmentation performance. Second, existing models rely heavily on labeled data. Deep learning-based segmentation requires large amounts of pixel-level annotated images for training, but high-quality wildfire image annotation is expensive and difficult to acquire, which limits the scalability and practical deployment of these models. To address these issues, researchers have proposed various methods to improve model performance. For example, Zheng et al. [9] introduced a Multi-scale Residual Group Attention (MRGA) mechanism to enhance the representation of small-scale targets and incorporated a Transformer structure to capture global context. Other researchers have incorporated prior knowledge into decoder design. The PMFD decoder [10] fuses multi-scale features guided by prior information to improve segmentation accuracy. The PAM module in MS-FRCNN [11] integrates both channel and spatial attention in parallel, which helps reduce background noise and increases the model’s sensitivity to fire regions.

To handle the limitations of single-scale feature extraction, many segmentation models enhance the decoder through one of the following strategies: (1) integrating feature pyramid networks to fuse multi-scale information [12,13]; (2) introducing self-attention mechanisms to strengthen feature representation and capture long-range dependencies [14,15]; and (3) designing new multi-scale fusion modules to improve edge detection and small object segmentation [16,17]. In addition, to reduce the dependence on annotated data, several methods based on unsupervised and weakly supervised learning have been proposed. Koottungal et al. [18] designed a semi-supervised wildfire segmentation model using a convolutional autoencoder to improve representation on unlabeled data. Wang et al. [19] proposed a weakly supervised framework that generates high-quality pseudo-labels using foreground-aware and context-aware pooling. Current alternative strategies include self-supervised semantic segmentation [20], semi-supervised segmentation frameworks [21,22], and pseudo-label generation based on weak supervision [23]. However, the application of these methods to wildfire image segmentation remains relatively limited.

To address these gaps, this paper proposes a semi-supervised semantic segmentation network for wildfire images that integrates multi-scale feature fusion and pixel-level contrastive consistency learning. The proposed model jointly enhances structural representation and supervision mechanisms to improve segmentation accuracy and model adaptability. The main contributions of this study are summarized as follows:

A Lagrange Interpolation Module (LIM) is designed to extract structured information from the same spatial position across feature maps at different scales. This enables effective multi-scale fusion during decoding, thereby improving the model’s ability to perceive fire boundaries, textures, and fine structural details.
A Pixel Contrast Consistency (PCC) mechanism is proposed to enforce pixel-level consistency between the labeled and unlabeled branches, allowing the model to maintain high segmentation accuracy in a semi-supervised setting and reduce the dependency on large-scale labeled datasets.
Experiments on the Flame and D-Fire datasets show that our method achieves up to 93.7% mIoU and consistently outperforms existing approaches under both full and limited supervision.

2. Related Work

2.1. Semantic Segmentation

Semantic segmentation, as one of the core tasks in computer vision, alongside image classification and object detection, constitutes the three foundational problems in the field. Unlike image classification, which focuses solely on overall category labels, and object detection, which identifies the location of specific objects, semantic segmentation aims to assign a semantic label to every pixel in an image. This enables a more comprehensive and fine-grained understanding of image content. In recent years, with the rapid development of deep learning technologies, semantic segmentation has demonstrated significant research value and broad application prospects in various domains, including autonomous driving, medical image analysis, and remote sensing image processing [24].

In terms of model architecture, early Fully Convolutional Networks (FCNs) achieved end-to-end pixel-level prediction by replacing the fully connected layers in traditional Convolutional Neural Networks with convolutional layers. This allowed the model to output prediction results consistent with the input image size, laying the foundation for the field of semantic segmentation. Building upon this, the U-Net [25] architecture introduced a symmetric encoder–decoder structure and employed skip connections to fuse low-level spatial information with high-level semantic features, effectively enhancing segmentation accuracy and robustness. Initially applied in medical image analysis, U-Net has since evolved into multiple variants and found widespread application across various tasks. Additionally, the DeepLab [26] series of models incorporated atrous convolution to expand the receptive field and integrated pyramid pooling modules to enhance multi-scale information representation, further improving segmentation performance in complex scenes. Recently, with the success of Transformer architectures in natural language processing, their powerful global modeling capabilities have been introduced into the vision domain. Transformer-based semantic segmentation models leverage self-attention mechanisms to capture long-range dependencies, highlight key regions, and suppress redundant features, gradually exhibiting superior semantic understanding and segmentation performance.

Semantic segmentation technologies have been widely applied in various practical scenarios [24,27]. In autonomous driving, they are used to accurately identify roads, vehicles, pedestrians, and other critical objects, providing high-precision environmental perception information for autonomous systems. In medical image analysis, semantic segmentation is extensively utilized for lesion detection, organ boundary delineation, and other tasks, assisting clinicians in diagnosis and surgical planning. In remote sensing image processing, semantic segmentation facilitates land cover classification, land use analysis, and more, offering foundational data support for geographic information system development. In industrial manufacturing, it aids in product defect detection and quality control, enhancing production automation levels. Furthermore, in virtual and augmented reality scenarios, semantic segmentation supports object recognition and scene understanding, improving user interaction experiences. Despite numerous breakthroughs in recent years, semantic segmentation still faces several unresolved challenges, such as precise segmentation in complex scenes, high dependency on computational resources, substantial demand for annotated data, and limited generalization capabilities across domains. With the continuous evolution of deep learning technologies and computing platforms, researchers are progressively introducing emerging mechanisms like prior knowledge guidance, semi-supervised learning, self-supervised learning, and multi-modal information fusion into semantic segmentation tasks, providing new avenues and research directions to enhance model performance and practical applicability.

2.2. Semi-Supervised Semantic Segmentation

In semantic segmentation tasks, traditional fully supervised methods rely on large volumes of high-quality pixel-level annotated data for model training. However, acquiring such detailed annotations is not only labor-intensive and time-consuming but also poses significant challenges in specific application scenarios, severely limiting model scalability and practicality. Consequently, semi-supervised semantic segmentation techniques have emerged, aiming to leverage a small amount of labeled data alongside a large corpus of unlabeled data to collaboratively train models, thereby reducing annotation costs while enhancing segmentation performance and generalization capabilities [28].

Currently, semi-supervised semantic segmentation methods primarily focus on effectively utilizing unlabeled data [29]. Common strategies include introducing consistency regularization mechanisms, where random perturbations (such as rotation, scaling, noise, etc.) are applied to input images, and the model is encouraged to produce consistent outputs before and after perturbation, thereby extracting potential supervisory information from unlabeled data. Other approaches employ pseudo-labeling, where an initial model generates predicted labels for unlabeled samples, which are then used as auxiliary supervisory signals alongside original labeled data during training, effectively expanding the training dataset. Joint training strategies are also widely adopted, typically involving the co-optimization of two network branches that respectively utilize labeled and unlabeled data for learning, exchanging information to enhance representational capacity. In recent years, Generative Adversarial Networks (GANs) have been incorporated into semi-supervised semantic segmentation, generating realistic images or label distributions to bolster the model’s learning capabilities on unlabeled samples.

Despite the notable advantages of semi-supervised semantic segmentation in alleviating data annotation burdens, several challenges persist [30]. For instance, pseudo-labeling mechanisms may introduce substantial noise, affecting training stability and final performance. In scenarios with domain shifts, discrepancies between labeled and unlabeled data distributions can hinder model generalization. Additionally, some semi-supervised methods, due to structural complexity or prolonged training procedures, may not meet the real-time requirements of practical applications. Nevertheless, with ongoing advancements in deep learning, particularly in self-supervised mechanisms, multi-task collaborative frameworks, and cross-modal fusion strategies, semi-supervised semantic segmentation continues to hold significant potential for reducing annotation costs, improving segmentation efficiency, and enhancing model robustness.

2.3. Semantic Segmentation of Forest Fire Images

The research trajectory of wildfire image segmentation can be broadly categorized into three stages: traditional image processing methods, machine learning approaches, and deep learning techniques. In the early stages, researchers primarily relied on traditional image processing techniques, such as thresholding methods [31], edge detection [32], and region growing [33], to segment wildfire images. However, these methods exhibited notable limitations [34,35,36]: sensitivity to lighting variations and image noise, difficulty in handling complex textures or similar colors, lack of adaptability to dynamic changes, poor generalizability, high parameter dependency, cumbersome manual adjustments, and low efficiency in processing large-scale image data, making them inadequate for meeting the dual demands of accuracy and robustness in practical applications.

In parallel with the development of image processing techniques, large-scale wildfire monitoring has significantly benefited from advances in remote sensing technologies and global fire detection datasets [37]. For example, the MODIS (Moderate Resolution Imaging Spectroradiometer) dataset provides near-real-time global fire detection capabilities, supporting environmental management and disaster response. Similarly, the VIIRS (Visible Infrared Imaging Radiometer Suite) and the FIRMS (Fire Information for Resource Management System) platforms offer high-temporal-resolution wildfire detection data, widely used for operational fire tracking and large-scale ecological assessments. These datasets play an essential role in global wildfire monitoring. However, their relatively coarse spatial resolution makes it difficult to capture fine-grained details such as fire boundaries and small fire targets, particularly in complex forest environments. Therefore, high-resolution wildfire image segmentation remains an important complementary task for enhancing the accuracy and precision of wildfire perception at the local scale.

With the advancement of machine learning technologies, wildfire image segmentation gradually transitioned to a data-driven modeling phase. For example, Zheng et al. [38] utilized an improved Backpropagation (BP) neural network algorithm to enhance model recognition accuracy, while Thach et al. [39] combined Random Forests with Multilayer Perceptron neural networks to model tropical wildfire risks, optimizing variable selection through correlation analysis. Moayedi et al. [40] proposed a hybrid method based on various evolutionary algorithms (such as Genetic Algorithms, Particle Swarm Optimization, and Differential Evolution) to construct fire-sensitive area models, significantly improving the accuracy of fire threat assessments. Despite achieving certain successes [41], machine learning methods still face challenges [42,43], including high dependency on annotated data, substantial computational overhead during model training, sensitivity to outliers, limited generalization capabilities, high risk of overfitting, and poor model interpretability, all of which constrain their promotion and practicality in intelligent wildfire monitoring systems.

In recent years, the emergence of deep learning technologies has propelled wildfire image semantic segmentation into its third phase. Benefiting from continuous improvements in computational power and network architecture design, researchers have extensively applied Convolutional Neural Networks (CNNs) and their derivatives to wildfire image segmentation tasks. Existing studies [6,44,45,46] have demonstrated that deep learning models exhibit significant advantages in feature extraction and semantic representation, enabling fine-grained segmentation of complex wildfire regions. For instance, Yuan et al. [47] introduced an edge-aware module into the U-Net structure, employing a strategy of freezing large models and fine-tuning Adapter modules, combined with a hybrid training mechanism of self-supervised and fully supervised learning, to enhance the model’s ability to identify wildfire boundaries. Other studies [48] have incorporated pixel-level loss weighting mechanisms to improve segmentation performance for fire spots and smoke regions. Niu et al. [49] proposed the FFDSM model, integrating YOLOv5s-seg, Efficient Channel Attention (ECA) mechanisms, and Spatial Pyramid Pooling Fast Convolutional Set Pooling Convolutional Set Pooling (SPPFCSPC) structures, effectively enhancing adaptability to various wildfire target morphologies. Despite the superior performance of deep learning methods in wildfire image semantic segmentation, several limitations remain. Firstly, training high-precision models heavily depends on large volumes of pixel-level annotated data, which are challenging and time-sensitive to obtain in the context of wildfires, severely restricting model generalization. Secondly, training deep neural networks requires substantial computational resources, imposing high demands on hardware capabilities. Additionally, models are prone to overfitting, especially when data are scarce or model structures are overly complex, potentially compromising their effectiveness in real-world applications.

To address these challenges, semi-supervised learning has gradually been introduced into the field of wildfire image semantic segmentation, emerging as a crucial approach to mitigate data scarcity issues. Although research on applying semi-supervised learning to wildfire image segmentation is still in its early stages, existing studies have shown that this method, by combining a small amount of labeled data with a large corpus of unlabeled data for collaborative training, not only effectively reduces the burden of data annotation, but also fully exploits the latent semantic information in unlabeled data. While maintaining the high performance of deep learning models, it holds promise for improving the accuracy, efficiency, and adaptability of wildfire image segmentation under constraints of limited data and computational resources, providing theoretical and technical support for the development of intelligent and practical wildfire monitoring systems.

3. Materials and Methods

3.1. Dataset

This study conducts experimental analysis using two publicly available wildfire image datasets, namely the Flame dataset and the D-Fire dataset. Both are representative resources in the field of wildfire image segmentation and recognition. Figure 1 presents sample images from the two datasets, which visually demonstrate the image quality, fire types, and complexity of the scenes.

The Flame dataset is a comprehensive and high-quality dataset specifically designed for wildfire image segmentation and recognition tasks. It contains a variety of images captured by drones at different altitudes and angles, covering diverse fire intensities, complex terrain backgrounds, and varying weather conditions. The dataset features high-resolution images and complex scene structures. All images are annotated at the pixel level by a professional team, including labels for flame regions, smoke areas, and their corresponding segmentation masks. In addition, some images provide auxiliary information such as fire spread speed and burned area, which supports model training for segmentation and recognition in wildfire scenarios. The Flame dataset contains a total of 2003 annotated image samples, all of which include precise pixel-level ground truth, making it suitable for model training, validation, and testing.

The D-Fire dataset is a more extensive multi-scene fire image dataset designed for tasks such as fire detection, recognition, and segmentation. It includes various types of fire scenarios, such as forest fires and urban structure fires, and covers different stages of fire development, including early ignition, fire spreading, and full combustion. The dataset offers both image and video samples, with each image and video frame manually annotated to include semantic information such as flame regions, smoke distribution, and background classes. This dataset supports fine-grained supervision required for training segmentation models. In this study, a total of 9869 representative wildfire images were selected from the D-Fire dataset to ensure diversity and representativeness in terms of scene type, fire intensity, and terrain structure for model training.

3.2. Methodology

Figure 2 illustrates the architecture of the proposed semi-supervised semantic segmentation model for wildfire images. The model consists of two collaborative sub-network branches, which are used to process labeled and unlabeled images, respectively, enabling joint learning with both types of data. In the first branch, labeled and unlabeled images are passed through an encoder–decoder network to produce predictions denoted as

\hat{y}

and

P_{w}

, respectively. In the second branch, the same unlabeled images are input into another encoder–decoder pathway to generate prediction

P_{m}

. The decoder is enhanced with a Lagrange Interpolation Module (LIM), which is applied at multiple stages to strengthen feature representation. The core idea of an LIM is to analyze multi-scale features from the same spatial positions in the encoder and to construct high-order Lagrange interpolation polynomials that capture cross-layer semantic dependencies and structured information. This mechanism enhances the model’s ability to identify flame boundaries and detailed regions. Furthermore, the two branches apply different data augmentation techniques (CutMix and Flip) to the same unlabeled image. The resulting augmented features, denoted as

F_{c u t m i x}

and

F_{f l i p}

, are extracted through the encoder. A Pixel Contrast Consistency (PCC) mechanism is then used to constrain these features, encouraging semantically similar pixels to be close in feature space while keeping features from different classes well separated. This design enables the model to learn more discriminative semantic representations from unlabeled samples, improving generalization while maintaining segmentation accuracy.

3.2.1. Data Augmentation

To fully explore the semantic information in unlabeled images and enhance model robustness, two data augmentation strategies (Flip and CutMix) are employed, with each being applied to a different sub-network branch.

Flip augmentation is a classical image transformation technique that generates new training samples by horizontally or vertically flipping the input images. This strategy improves the model’s ability to adapt to viewpoint changes and increases data diversity, which is particularly useful for natural scenes with directional variance, such as wildfire imagery. Figure 3 shows the effect of Flip augmentation.

CutMix combines the principles of CutOut and MixUp to create new training samples by patch-wise mixing two different wildfire images. Specifically, a rectangular region is randomly selected from image A and replaced with the corresponding region from image B. The label mask is updated accordingly. This augmentation retains the semantic structure of each source image while introducing spatial variation, thereby improving the model’s ability to generalize to local fire patterns and enhancing its discrimination of regional fire features. An example of CutMix-generated training data is shown in Figure 4.

3.2.2. Encoding Stages

For the encoder, we employ the ResNet architecture as the backbone network, which has proven effective in image recognition tasks due to its deep feature representation capabilities. The encoder consists of five sequential feature extraction stages, which progressively abstract the input image into deeper semantic representations. In the first three stages, standard convolution and downsampling operations are used to capture both local and global context information. Since successive downsampling may lead to loss of spatial detail, atrous convolution is applied in the fourth and fifth stages to expand the receptive field while maintaining the spatial resolution of feature maps. This design enhances the model’s sensitivity to flame boundaries and detailed textures, minimizing spatial information loss caused by downsampling.

With this structure, the last three stages produce feature maps of the same spatial resolution, which facilitates subsequent feature alignment and fusion. Specifically, for an input image with resolution

H \times W

, the output feature map resolutions from the five stages are

H ∕ 2 \times W ∕ 2

,

H ∕ 4 \times W ∕ 4

,

H ∕ 8 \times W ∕ 8

,

H ∕ 8 \times W ∕ 8

, and

H ∕ 8 \times W ∕ 8

, respectively. This design ensures better spatial alignment in the decoder, thus improving the overall performance and efficiency of the segmentation network.

3.2.3. Lagrange Interpolation Module

In semantic segmentation tasks, feature maps at different scales contain multi-level information of the image. Shallow features typically retain edge contours and texture details, while deeper features encode more abstract semantic information. Existing methods often fuse multi-scale features in the decoder stage through concatenation or element-wise addition. However, such simple fusion techniques do not explicitly model structural relationships between features, which limits the model’s ability to capture semantic structure, especially in challenging cases such as wildfire images, where flame boundaries are often blurred and texture patterns are complex. To address this, we introduce the LIM, which employs a high-order interpolation strategy to enhance multi-scale feature fusion, enabling structured semantic representation. The LIM utilizes Lagrange interpolation to construct polynomial functions over the feature space, modeling the continuous relationships among features from different layers. This enhances the model’s ability to perceive hierarchical structures and improves its sensitivity to weak signals and edge contours in wildfire segmentation.

Assume there are

n + 1

distinct sampling points

X_{0}, X_{1}, \dots, X_{n}

with corresponding function values

f (X_{0}), f (X_{1}), \dots, f (X_{n})

. The objective of Lagrange interpolation is to construct an

n

-degree polynomial that satisfies the following.

L_{n} (X_{i}) = f (X_{i}), i = 0, 1, \dots, n

(1)

To achieve this, the

i

-th Lagrange basis function is defined as follows.

L_{i} (X) = \prod_{j = 0, j \neq i} \frac{X - X_{j}}{X_{i} - X_{j}}

(2)

Then, the interpolating polynomial

L_{n} (X)

is constructed as follows.

L_{n} (X) = \sum_{i = 0}^{n} f (X_{i}) \cdot L_{i} (X)

(3)

One key advantage of Lagrange interpolation is that it does not require derivatives or differences, relying solely on function values. This makes it especially suitable for modeling discrete values at pixel locations in deep feature maps.

In implementation, the input to the LIM consists of the output feature map from the current decoder stage and a set of feature maps from different encoder stages, denoted as

{\{f (X_{k})\}}_{k = 1}^{K}

. At each spatial location

(h, w)

, these maps form a set of sampled values

f (X_{0}), f (X_{1}), \dots, f (X_{K})

, with each

f (X_{k})

representing the feature response at that location. The LIM constructs interpolation basis functions and computes enhanced features as follows.

f_{L I M} (X) = \sum_{k = 0}^{K} f (X_{k}) \cdot L_{k} (X)

(4)

As illustrated in Figure 5, the LIM uses the current decoder output as the primary pathway and incorporates multi-scale features from encoder layers to assist the interpolation. These features, from different depths, contribute diverse contextual semantics, helping to build a more comprehensive and structurally consistent representation. Notably, the interpolated features produced by LIM not only integrate multi-scale information, but also preserve spatial continuity, which leads to clearer semantic boundaries and finer local details.

Compared with traditional linear fusion strategies, the interpolated features constructed by the LIM offer several advantages: (1) high-order feature modeling while preserving spatial consistency; (2) enhanced semantic structure representation without reducing resolution; and (3) improved sensitivity to fine-grained details such as edges and textures. These benefits make the LIM particularly effective in wildfire image segmentation scenarios characterized by blurred flame boundaries and strong illumination interference.

3.2.4. Pixel Contrast Consistency

Most existing semi-supervised semantic segmentation methods adopt consistency regularization strategies, which compare predictions from different augmented views of the same unlabeled image and enforce prediction consistency through a loss function. These methods have shown promising performance in regulating the generation of pseudo-labels. However, they often overlook intra-class feature consistency, which may limit the model’s discriminative ability within semantic categories. To address this issue, we propose a PCC mechanism that performs pixel-level contrastive learning across different augmented views of the same image. This mechanism encourages the model to cluster semantically similar pixels and separate those from different classes in the feature space, thus enhancing discriminative learning and generalization on unlabeled data.

Given an unlabeled wildfire image

{I m a g e}_{u}

, we apply two different data augmentation strategies, CutMix and Flip, to obtain two augmented versions

u_{c u t m i x}

and

u_{f l i p}

.

u_{c u t m i x} = {A u g}_{c u t m i x} ({I m a g e}_{u})

(5)

u_{f l i p} = {A u g}_{f l i p} ({I m a g e}_{u})

(6)

These images are passed through the ResNet encoder to generate corresponding feature maps

F_{c u t m i x}

and

F_{f l i p}

. Each encoding process yields five-stage multi-scale features,

F_{c u t m i x 1}, F_{c u t m i x 2}, \dots, F_{c u t m i x 5}

and

F_{f l i p 1}, F_{f l i p 2}, \dots, F_{f l i p 5}

, which are used as inputs to the subsequent LIM for structural enhancement.

We then compute the cosine similarity between the two feature maps at the pixel level.

S_{c f} = \frac{1}{N} \sum_{i = 1}^{N} \frac{d_{c u t m i x}^{i} \cdot d_{f l i p}^{i}}{| | d_{c u t m i x}^{i} | | \times | | d_{f l i p}^{i} | |}

(7)

Here,

N

is the total number of pixels in the feature maps, and

d_{c u t m i x}^{i}, d_{f l i p}^{i}

represent the feature vectors at the

i

-th pixel from the respective views. Based on this, the contrastive similarity loss between

F_{c u t m i x}

and

F_{f l i p}

is defined as follows:

L_{c f} = 1 - \frac{1}{B_{u}} \sum_{i = 1}^{B_{u}} S_{c f}^{i}

(8)

where

B_{u}

is the batch size for unlabeled images, and

S_{c f}^{i}

denotes the average pixel-wise similarity of the

i

-th sample.

To further capture intra-class consistency and inter-class separation, we construct intra-class and inter-class feature sets. After decoding,

F_{c u t m i x}

and

F_{f l i p}

yield class probability distributions

p_{c}

and

p_{f}

. In the Flip branch, the intra-class feature set for the

k

-th class is defined as follows:

M_{i n}^{k} = {r_{k} | c o s (r_{k}, p_{k}) \geq α}

(9)

where

c o s (\cdot)

denotes cosine similarity,

r_{k}

is the set of feature vectors for class

k

,

p_{k}

is the class prototype, and

α

is a threshold for intra-class similarity. Similarly, the inter-class (outlier) feature set in the CutMix branch is as follows:

M_{d i s}^{k} = {h_{k} | c o s (h_{k}, p_{k}) \leq ε}

(10)

where

h_{k}

denotes outlier features for class

k

, and

ε

is a lower similarity threshold. The contrastive distance loss between intra-class and outlier features is then computed as follows:

L_{d t} = \frac{1}{Z} \sum_{i = 1}^{Z} \frac{1}{N_{d}} \sum {N e a r}_{c o s} (M_{i n}^{i}, M_{d i s}^{i})

(11)

where

Z

is the number of semantic classes,

N_{d}

is the number of outlier samples per class, and

{N e a r}_{c o s} (\cdot)

denotes the cosine distance to the nearest intra-class prototype.

To further ensure semantic consistency between the two augmented branches, we introduce a distribution-level prediction alignment loss between

p_{c}

and

p_{f}

:

L_{c} = \frac{1}{B_{u}} \sum_{i = 1}^{B_{u}} d (p_{c}^{i}, p_{f}^{i})

(12)

where

d (\cdot)

denotes a distance metric between distributions, such as KL divergence or mean squared error (MSE).

For the labeled data, a simple Flip augmentation is applied. The model prediction

\hat{y}

is compared with the ground truth

y

using a standard cross-entropy loss:

L_{s} = \frac{1}{B_{L}} \cdot \frac{1}{H \times W} \sum_{i = 1}^{B_{L}} \sum_{j = 1}^{H \times W} l_{c e} ({\hat{y}}^{i} (j), y^{i} (j))

(13)

where

H \times W

denotes the spatial resolution of the image,

j

is the pixel index,

B_{L}

is the batch size for labeled images, and

l_{c e} (\cdot)

represents the cross-entropy function.

4. Experiments and Results

4.1. Experiment Setup

All experiments were conducted on a high-performance computing platform equipped with an NVIDIA RTX 3090 GPU (NVIDIA, Santa Clara, CA, USA) and running the Ubuntu 18.04 operating system. The software environment was based on Python 3.7 and the PyTorch 1.7 deep learning framework. The Flame and D-Fire datasets used in this study consist of RGB images with three channels, originally captured by UAVs or ground-based cameras in diverse wildfire scenarios. To ensure consistency, all images were resized to a resolution of

256 \times 256

pixels using bilinear interpolation. The images were stored in standard JPG format, and all pixel values were normalized to the range [0, 1]. During model training, random data augmentation strategies were applied to both labeled and unlabeled samples to improve robustness and generalization. For labeled data, random horizontal flipping (probability of 0.5), random cropping to the final size of

256 \times 256

pixels, and color jittering (with brightness, contrast, and saturation randomly adjusted within

\pm

20% of the original values) were employed. For unlabeled data, two independent augmentations were applied to each image in an online on-the-fly manner within the training pipeline. Specifically, Flip augmentation involved a random horizontal flip with a probability of 0.5, while CutMix augmentation followed the standard CutMix procedure, where a randomly selected rectangular region (occupying 25% to 50% of the image area) from one unlabeled image was replaced with the corresponding region from another unlabeled image, and the corresponding label masks were updated accordingly.

SGD was used to update model parameters, with an initial learning rate of 0.001. The momentum factor was set to 0.9, and the weight decay coefficient was set to

5 \times 10^{- 4}

to prevent overfitting. The learning rate decayed exponentially every 10 epochs, with a decay rate of 0.95. Model training was performed for a total of 100 epochs. The batch size was set to 8 for labeled data and 16 for unlabeled data to maintain a balance between training stability and memory usage. In the PCC module, the intra-class aggregation threshold

α

and the inter-class outlier threshold

ε

were set to 0.85 and 0.35, respectively, to balance discriminative capability and semantic cohesion.

As the primary evaluation metric, the mean Intersection over Union (mIoU) was adopted, which is widely used in semantic segmentation tasks. The mIoU is defined as follows:

m I o U = \frac{1}{N} \sum_{i = 1}^{N} {I o U}_{i}

(14)

where

N

is the total number of semantic classes, and

{I o U}_{i}

represents the Intersection over Union for the

i

-th class. In the wildfire image segmentation task, the number of classes is two, namely the background class and the fire spot class. A higher mIoU value indicates that the predicted segmentation results are more consistent with the ground truth, making it an effective metric to assess a model’s performance across different semantic regions.

4.2. Comparison with Other Methods

To evaluate the effectiveness of the proposed semi-supervised semantic segmentation model in wildfire image scenarios, we conducted comparative experiments against several state-of-the-art methods. These include pseudo-labeling-based approaches (e.g., PseudoSeg [50], UniMatch [30]), consistency regularization methods (e.g., ECS [51], DCC [28], ELN [29], ESL [22]), knowledge-distillation-based models (e.g., MKD [52]), cross pseudo supervision (e.g., CPCL [53]), and attention-guided strategies (e.g., SemiCVT [54], S4Former [55], Allspark [56]). For a fair and objective comparison, all models were trained and tested under the same experimental environment. We evaluated the methods on two representative wildfire segmentation datasets (Flame and D-Fire) under four different labeling ratios: 1/8, 1/4, 1/2, and full supervision.

As shown in Table 1, the performance of all models improves with the increase in labeled data, demonstrating the critical role of supervision even in small quantities. MKD, for instance, achieves 88.1% mIoU in the fully supervised setting, but only 70.8% in the 1/8 setting. This method leverages knowledge distillation to transfer knowledge from labeled to unlabeled data but suffers from high complexity and limited spatial feature awareness. CPCL performs relatively poorly across all labeling ratios, especially in the low-label regime, as it relies heavily on the quality of pseudo-labels and is more sensitive to the variation in wildfire image conditions, such as illumination, occlusion, and scale variance. In contrast, our model consistently achieves top or second-best performance across all settings. In particular, with a Transformer backbone and full supervision, it reaches the highest mIoU of 91.6%. This superior performance is attributed to the proposed LIM, which effectively fuses multi-scale features at corresponding spatial locations in the decoder, enhancing the recognition of flame boundaries and smoke gradients. Additionally, the PCC module introduces contrastive constraints in the feature space, improving discrimination on unlabeled data and refining pseudo-label quality.

Figure 6 shows visual comparisons of segmentation results from different models on the Flame dataset using 1/8 labeled data and the ResNet backbone. Due to small-scale and partially occluded fire targets, ECS and DCC tend to miss flames, while UniMatch captures targets but struggles with boundary precision. Our model, however, accurately segments small fire regions and preserves fine-grained details.

Figure 7 illustrates segmentation visualizations on the D-Fire dataset, demonstrating our model’s superior ability to distinguish flame boundaries and regions with high visual similarity between fire and background.

As shown in Table 2, model performance consistently improves with more labeled data. MKD and DCC perform well under high-label settings, with MKD reaching 91.9% mIoU under full supervision. However, MKD struggles in low-label scenarios (73.6% at 1/8). CPCL again shows unstable results due to its reliance on pseudo-label quality. In contrast, our model outperforms all other ResNet-based methods across all label ratios. Notably, it achieves 85.8% and 92.8% mIoU at 1/2 and full supervision, respectively. With the Transformer backbone, our method surpasses the previous best-performing model, SemiCVT and S4Former, reaching 93.7% mIoU. The results highlight the effectiveness of the LIM in fusing multi-scale structure and the PCC in enforcing semantic consistency on unlabeled data.

4.3. Ablation Experiments

To validate the contributions of the proposed key modules to model performance enhancement, we designed systematic ablation experiments to assess the individual impacts of the Lagrangian Interpolation Module (LIM) and Pixel-wise Consistency Constraint (PCC) under varying levels of supervision. Experiments were conducted on the Flame and D-Fire wildfire semantic segmentation datasets, employing ResNet as the unified backbone for feature extraction. The proportions of annotated data were set to 1/8, 1/4, and 1/2, simulating typical scenarios ranging from weak to moderate supervision. Each experimental group was trained under identical settings, with the LIM and PCC module incrementally activated to compare performance variations across different configurations. The results are presented in Table 3.

As observed in Table 3, the incremental integration of functional modules leads to a consistent increase in mIoU values on both the Flame and D-Fire datasets, indicating that both the LIM and PCC module significantly enhance segmentation accuracy. Specifically, the LIM introduces a Lagrangian interpolation strategy during the decoding phase, enabling the structural fusion of multi-scale feature maps at corresponding spatial locations. This enhances the model’s ability to recognize complex regions such as flame edges and texture details. Even without the PCC module, activating the LIM alone yields noticeable gains. For instance, under the 1/2 annotation proportion, the mIoU on the Flame dataset improves from 77.6% to 79.9% and on D-Fire from 79.4% to 83.5%. Under the sparse 1/8 setting, Flame sees a 1.8% increase and D-Fire a 2.9% increase, demonstrating the LIM’s robust structural modeling capabilities under limited annotations. In contrast, the PCC module enhances model performance from a different perspective. It imposes pixel-level consistency constraints on unlabeled images under CutMix and Flip data augmentation views, guiding the model to learn intra-class compact and inter-class separable semantic representations. Without the LIM, introducing PCC alone results in even greater improvements, particularly under the 1/8 and 1/4 settings. For example, the Flame dataset’s mIoU increases from 70.4% to 73.5% and D-Fire from 72.8% to 76.1%, indicating that PCC significantly improves pseudo-label quality and feature distribution discriminability under low annotation ratios. When both modules are activated simultaneously, the model achieves optimal performance. Under the 1/2 setting, Flame and D-Fire reach mIoU values of 82.6% and 85.8%, respectively, representing improvements of 5.0% and 6.4% over the baseline. Notably, the combined effect of the two modules surpasses that of any single module across all annotation proportions, suggesting that the LIM and PCC complement each other in structural feature extraction and semantic consistency modeling, collectively enhancing the model’s capabilities in edge clarity, object completeness, and regional discriminability.

To visually observe the effects of each module, Figure 8 presents class activation maps after integrating different modules. In these maps, deeper red indicates key feature regions contributing most to the prediction, while blue indicates weaker contributions. The color intensity reflects the response strength, with deeper colors representing higher response values. After adding the LIM, flame contour features become more prominent, indicating the LIM’s effectiveness in enhancing edge feature extraction. However, in cases with small flame points and significant occlusion, the LIM’s enhancement is limited. Integrating PCC, with its strong pixel-level contrastive capabilities, effectively reduces the background’s influence on flame feature extraction, leading to more accurate feature extraction, especially in scenarios with small and heavily occluded flame points.

To further quantify the impact of different structural enhancement modules on model performance, we compared three commonly used modules in semantic segmentation tasks: the Squeeze-and-Excitation (SE) block, the Convolutional Block Attention Module (CBAM), and the Efficient Channel Attention (ECA) block. Additionally, we integrated our proposed LIM into the baseline model under identical training configurations using a 1/2 annotation data proportion. The results are shown in Table 4.

As shown in Table 4, the baseline model achieves mIoU values of 77.6% on the Flame dataset and 79.4% on D-Fire. Introducing CBAM, SE-Block, ECA-Block, and LIM enhances model performance, with the LIM yielding the most significant improvements by increasing mIoU to 79.9% on Flame (a 2.3% increase) and to 83.5% on D-Fire (a 4.1% increase). This further corroborates the LIM’s advantages in fusing multi-scale structural information and reinforcing spatial consistency modeling, particularly excelling in handling wildfire images with blurred boundaries or sparse textures. In comparison, while the CBAM offers relatively smaller improvements (1.8% on Flame and 4.0% on D-Fire), it still provides certain boundary refinement capabilities on the D-Fire dataset. The SE-Block and ECA-Block, with lower parameter counts, offer moderate performance gains. Notably, the ECA-Block, as a lightweight solution, balances performance and computational efficiency, making it practically valuable in resource-constrained scenarios. In model design, module selection can be flexibly adjusted based on task complexity, resource constraints, and performance requirements. If segmentation accuracy is the primary objective, the LIM is recommended. For a balance between efficiency and performance, the ECA-Block and SE-Block serve as lightweight alternatives.

4.4. Computational Complexity and Scalability Analysis

To assess the computational efficiency and scalability of the proposed method, we compare its parameter count and training time with representative semi-supervised segmentation models on the Flame dataset under the 1/2 labeling ratio. All experiments are conducted using an input resolution of

256 \times 256

and a consistent hardware platform equipped with an NVIDIA RTX 3090 GPU (NVIDIA, Santa Clara, CA, USA). The segmentation performance is measured using the mIoU, and the training time is reported as the average processing time per epoch.

As summarized in Table 5, our method achieves superior segmentation accuracy while maintaining competitive model complexity. Compared to other ResNet-based approaches, our method achieves the highest mIoU (82.6%) with only 37.9 M parameters, which is fewer than several other methods such as DCC (40.3 M), ELN (41.9 M), and UniMatch (45.5 M). The average training time per epoch is only 338 s, comparable to or lower than many baselines, indicating the efficiency of the proposed design. When combined with a Transformer backbone, our method further improves segmentation accuracy to 84.5% mIoU, outperforming strong Transformer-based baselines such as SemiCVT (82.4%), S4Former (83.6%), and Allspark (82.7%), while maintaining a similar parameter count (53.7 M) and training time (485 s per epoch). These results demonstrate that the proposed framework is both effective and scalable, offering an attractive trade-off between accuracy and computational cost.

We further evaluate the scalability of the proposed framework under different input resolutions. Table 6 presents mIoU results for input sizes of

256 \times 256

,

384 \times 384

, and

512 \times 512

across various labeling ratios on the Flame dataset. As shown in Table 6, the model maintains stable and competitive segmentation performance across all input sizes and supervision levels. Even with limited labeled data (e.g., 1/8 labeling ratio), the model benefits from higher resolution inputs, confirming its scalability and robustness to different image scales, which are common in real-world wildfire monitoring scenarios.

5. Discussion

This study aimed to address the prevalent dependency on large-scale annotated data in wildfire image semantic segmentation by proposing a semi-supervised segmentation framework that integrates multi-scale structural feature modeling and pixel-wise contrastive consistency learning. Through systematic experiments on the Flame and D-Fire datasets, our approach consistently outperformed mainstream semi-supervised segmentation methods across varying annotation ratios. Notably, on the D-Fire dataset, the model achieved a state-of-the-art mIoU of 93.7% when employing a Transformer backbone, demonstrating the framework’s effectiveness in improving segmentation accuracy and generalization.

Compared to existing approaches such as pseudo-label-based methods (e.g., PseudoSeg), consistency regularization (e.g., ECS, ELN), and knowledge distillation frameworks (e.g., MKD), the proposed method offers distinct advantages in structural representation and supervision. Specifically, the Lagrange Interpolation Module (LIM) enables spatially consistent multi-scale feature modeling during decoding, overcoming the limitations of conventional fusion operations. The Pixel Contrast Consistency (PCC) mechanism enforces semantic-level feature alignment across different augmented views of unlabeled data, which enhances the discriminative power of learned features and improves pseudo-label quality. Together, these modules contribute to more accurate boundary localization and better feature separation across semantic classes.

In terms of computational efficiency, the proposed LIM and PCC module are lightweight by design and introduce only marginal overhead compared to baseline methods. Experiments confirm that our framework maintains high segmentation accuracy and stable resource consumption even when processing higher-resolution inputs (e.g.,

384 \times 384

,

512 \times 512

) or operating under varying levels of supervision. These characteristics demonstrate the framework’s strong scalability and make it suitable for deployment in large-scale real-time wildfire monitoring systems.

Despite the strong empirical results, several limitations remain. First, while Flame and D-Fire are high-quality datasets, they still exhibit limitations in scale, geographic diversity, and scene complexity, which may restrict the generalizability of the model to global wildfire environments. Moreover, real-world wildfire monitoring involves numerous extreme environmental conditions not fully captured in public datasets, such as dense smoke, intense sunlight, specular reflections from water bodies, fog, or rain. Although some of these are partially represented in our evaluation, further studies are needed to rigorously assess model performance under such challenges. Future research will, therefore, explore robustness-oriented improvements through advanced data augmentation, domain adaptation techniques, and the targeted collection of field data under adverse weather conditions. Finally, while this work focuses on static image segmentation, we recognize that dynamic analysis of wildfire video data is essential for practical fire monitoring, particularly for modeling fire evolution and spread. In future work, we plan to extend the framework to video-based segmentation by incorporating temporal modeling techniques and curating dedicated wildfire video datasets to enhance the system’s real-world applicability.

6. Conclusions

This study addresses the challenges in forest fire image semantic segmentation, particularly the high cost of data annotation, insufficient structural modeling capability, and limited pseudo-label quality. We proposed a semi-supervised segmentation framework that combines multi-scale structural information with pixel-wise contrastive learning, aiming to achieve accurate and efficient segmentation of wildfire regions under limited annotation conditions. The framework integrates two key modules: the Lagrange Interpolation Module, which enhances the model’s ability to capture edge and texture details by constructing structured multi-scale feature representations during decoding, and the Pixel Contrast Consistency mechanism, which enforces pixel-level contrastive constraints between augmented views to extract more discriminative features and improve the utilization of unlabeled data. Experimental results of two public wildfire datasets demonstrate that our method consistently outperforms existing state-of-the-art approaches across different levels of supervision, validating its effectiveness and generalizability. This work offers a novel approach for disaster image segmentation under low-supervision settings and shows strong potential for real-world applications in intelligent wildfire monitoring. Future research will focus on enhancing cross-domain generalization, expanding weakly supervised learning strategies, and optimizing lightweight model deployment to further improve the practical utility and scalability of the proposed framework in complex environments.

Author Contributions

Conceptualization, Y.S.; methodology, Y.S.; software, Y.S.; validation, W.W.; formal analysis, W.W.; investigation, J.G.; re-sources, Y.X.; data curation, H.L.; writing—original draft preparation, Y.S.; writing—review and editing, Y.X.; visualization, Y.S.; supervision, H.L.; project administration, Y.S.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

Jiangsu Provincial University Philosophy and Social Science General Project (Grant No. 2023SJYB0676); Key Scientific Research Project of Zijin College, Nanjing University of Science and Technology (Grant No. 2023ZRKX0401003); Start-up Fund for New Talented Researchers of Nanjing University of Industry Technology (Grant No. YK22-05-01).

Data Availability Statement

The Flame dataset is available at https://ieee-dataport.org/open-access/flame-dataset-aerial-imagery-pile-burn-detection-using-drones-uavs (accessed on 2 January 2025). The D-Fire dataset is available at https://github.com/gaiasd/DFireDataset (accessed on 20 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, G.; Li, H.; Xiao, Q.; Yu, P.; Ding, Z.; Wang, Z.; Xie, S. Fighting against Forest Fire: A Lightweight Real-Time Detection Approach for Forest Fire Based on Synthetic Images. Expert Syst. Appl. 2025, 262, 125620. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Y.; Khan, Z.A.; Huang, A.; Sang, J. Multi-Level Feature Fusion Networks for Smoke Recognition in Remote Sensing Imagery. Neural Netw. 2025, 184, 107112. [Google Scholar] [CrossRef]
Dampage, U.; Bandaranayake, L.; Wanasinghe, R.; Kottahachchi, K.; Jayasanka, B. Forest Fire Detection System Using Wireless Sensor Networks and Machine Learning. Sci. Rep. 2022, 12, 46. [Google Scholar] [CrossRef]
Pincott, J.; Tien, P.W.; Wei, S.; Calautit, J.K. Indoor Fire Detection Utilizing Computer Vision-Based Strategies. J. Build. Eng. 2022, 61, 105154. [Google Scholar] [CrossRef]
Sousa, M.J.; Moutinho, A.; Almeida, M. Wildfire Detection Using Transfer Learning on Augmented Datasets. Expert Syst. Appl. 2020, 142, 112975. [Google Scholar] [CrossRef]
Zhang, L.; Wang, M.; Ding, Y.; Wan, T.; Qi, B.; Pang, Y. FBC-ANet: A Semantic Segmentation Model for UAV Forest Fire Images Combining Boundary Enhancement and Context Awareness. Drones 2023, 7, 456. [Google Scholar] [CrossRef]
Singh, H.; Ang, L.-M.; Lewis, T.; Paudyal, D.; Acuna, M.; Srivastava, P.K.; Srivastava, S.K. Trending and Emerging Prospects of Physics-Based and ML-Based Wildfire Spread Models: A Comprehensive Review. J. Res. 2024, 35, 135. [Google Scholar] [CrossRef]
Ghali, R.; Akhloufi, M.A.; Mseddi, W.S. Deep Learning and Transformer Approaches for UAV-Based Wildfire Detection and Segmentation. Sensors 2022, 22, 1977. [Google Scholar] [CrossRef] [PubMed]
Zheng, Y.; Wang, Z.; Xu, B.; Niu, Y. Multi-Scale Semantic Segmentation for Fire Smoke Image Based on Global Information and U-Net. Electronics 2022, 11, 2718. [Google Scholar] [CrossRef]
Li, K.; Yuan, F.; Wang, C. An Effective Multi-Scale Interactive Fusion Network with Hybrid Transformer and CNN for Smoke Image Segmentation. Pattern Recognit. 2025, 159, 111177. [Google Scholar] [CrossRef]
Zhang, L.; Wang, M.; Ding, Y.; Bu, X. MS-FRCNN: A Multi-Scale Faster RCNN Model for Small Target Forest Fire Detection. Forests 2023, 14, 616. [Google Scholar] [CrossRef]
Van Quyen, T.; Kim, M.Y. Feature Pyramid Network with Multi-Scale Prediction Fusion for Real-Time Semantic Segmentation. Neurocomputing 2023, 519, 104–113. [Google Scholar] [CrossRef]
Hong, Y.; Pan, H.; Sun, W.; Wang, Y.; Bian, N.; Gao, H. Dual Spatial-Temporal Feature Pyramid With Decoupled Temporal Mining for Video Semantic Segmentation. IEEE Trans. Intell. Veh. 2024, 1–12. [Google Scholar] [CrossRef]
Ma, Y.; Lan, X. Semantic Segmentation Using Cross-Stage Feature Reweighting and Efficient Self-Attention. Image Vis. Comput. 2024, 145, 104996. [Google Scholar] [CrossRef]
Liu, Y.; Zhou, Q.; Wang, J.; Wang, Z.; Wang, F.; Wang, J.; Zhang, W. Dynamic Token-Pass Transformers for Semantic Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1827–1836. [Google Scholar]
Liu, J.; Zhang, F.; Zhou, Z.; Wang, J. BFMNet: Bilateral Feature Fusion Network with Multi-Scale Context Aggregation for Real-Time Semantic Segmentation. Neurocomputing 2023, 521, 27–40. [Google Scholar] [CrossRef]
Wang, L.; Zhang, C.; Li, R.; Duan, C.; Meng, X.; Atkinson, P.M. Scale-Aware Neural Network for Semantic Segmentation of Multi-Resolution Remote Sensing Images. Remote Sens. 2021, 13, 5015. [Google Scholar] [CrossRef]
Koottungal, A.; Pandey, S.; Nambiar, A. Semi-Supervised Classification and Segmentation of Forest Fire Using Autoencoders. In Proceedings of the International Conference on Advanced Concepts for Intelligent Vision Systems, Kumamoto, Japan, 21–23 August 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 27–39. [Google Scholar]
Wang, J.; Wang, Y.; Liu, L.; Yin, H.; Ye, N.; Xu, C. Weakly Supervised Forest Fire Segmentation in UAV Imagery Based on Foreground-Aware Pooling and Context-Aware Loss. Remote Sens. 2023, 15, 3606. [Google Scholar] [CrossRef]
Yang, Z.; Yu, H.; He, Y.; Sun, W.; Mao, Z.-H.; Mian, A. Fully Convolutional Network-Based Self-Supervised Learning for Semantic Segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 132–142. [Google Scholar] [CrossRef]
Li, L.; Zhang, W.; Zhang, X.; Emam, M.; Jing, W. Semi-Supervised Remote Sensing Image Semantic Segmentation Method Based on Deep Learning. Electronics 2023, 12, 348. [Google Scholar] [CrossRef]
Ma, J.; Wang, C.; Liu, Y.; Lin, L.; Li, G. Enhanced Soft Label for Semi-Supervised Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 1185–1195. [Google Scholar]
Pan, J.; Ou, X.; Xu, L. A Collaborative Region Detection and Grading Framework for Forest Fire Smoke Using Weakly Supervised Fine Segmentation and Lightweight Faster-RCNN. Forests 2021, 12, 768. [Google Scholar] [CrossRef]
Cheng, J.; Deng, C.; Su, Y.; An, Z.; Wang, Q. Methods and Datasets on Semantic Segmentation for Unmanned Aerial Vehicle Remote Sensing Images: A Review. ISPRS J. Photogramm. Remote Sens. 2024, 211, 1–34. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings, Part III 18, Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected Crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Yuan, B.; Zhao, D. A Survey on Continual Semantic Segmentation: Theory, Challenge, Method and Application. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10891–10910. [Google Scholar] [CrossRef] [PubMed]
Lai, X.; Tian, Z.; Jiang, L.; Liu, S.; Zhao, H.; Wang, L.; Jia, J. Semi-Supervised Semantic Segmentation with Directional Context-Aware Consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1205–1214. [Google Scholar]
Kwon, D.; Kwak, S. Semi-Supervised Semantic Segmentation with Error Localization Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 9957–9967. [Google Scholar]
Yang, L.; Qi, L.; Feng, L.; Zhang, W.; Shi, Y. Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7236–7246. [Google Scholar]
Bhargavi, K.; Jyothi, S. A Survey on Threshold Based Segmentation Technique in Image Processing. Int. J. Innov. Res. Dev. 2014, 3, 234–239. [Google Scholar]
Qiu, T.; Yan, Y.; Lu, G. An Autoadaptive Edge-Detection Algorithm for Flame and Fire Image Processing. IEEE Trans. Instrum. Meas. 2011, 61, 1486–1493. [Google Scholar] [CrossRef]
Yu, Q.; Clausi, D.A. IRGS: Image Segmentation Using Edge Penalties and Region Growing. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 2126–2139. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Peng, T.; Lu, Z. Comparative Research on Forest Fire Image Segmentation Algorithms Based on Fully Convolutional Neural Networks. Forests 2022, 13, 1133. [Google Scholar] [CrossRef]
Jonnalagadda, A.V.; Hashim, H.A. SegNet: A Segmented Deep Learning Based Convolutional Neural Network Approach for Drones Wildfire Detection. Remote Sens. Appl. 2024, 34, 101181. [Google Scholar] [CrossRef]
Shahid, M.; Chen, S.-F.; Hsu, Y.-L.; Chen, Y.-Y.; Chen, Y.-L.; Hua, K.-L. Forest Fire Segmentation via Temporal Transformer from Aerial Images. Forests 2023, 14, 563. [Google Scholar] [CrossRef]
Yang, S.; Huang, Q.; Yu, M. Advancements in Remote Sensing for Active Fire Detection: A Review of Datasets and Methods. Sci. Total. Environ. 2024, 943, 173273. [Google Scholar] [CrossRef] [PubMed]
Zheng, S.; Gao, P.; Zou, X.; Wang, W. Forest Fire Monitoring via Uncrewed Aerial Vehicle Image Processing Based on a Modified Machine Learning Algorithm. Front. Plant. Sci. 2022, 13, 954757. [Google Scholar] [CrossRef] [PubMed]
Thach, N.N.; Ngo, D.B.-T.; Xuan-Canh, P.; Hong-Thi, N.; Thi, B.H.; Nhat-Duc, H.; Dieu, T.B. Spatial Pattern Assessment of Tropical Forest Fire Danger at Thuan Chau Area (Vietnam) Using GIS-Based Advanced Machine Learning Algorithms: A Comparative Study. Ecol. Inf. 2018, 46, 74–85. [Google Scholar] [CrossRef]
Moayedi, H.; Mehrabi, M.; Bui, D.T.; Pradhan, B.; Foong, L.K. Fuzzy-Metaheuristic Ensembles for Spatial Assessment of Forest Fire Susceptibility. J. Environ. Manag. 2020, 260, 109867. [Google Scholar] [CrossRef]
Alkhatib, R.; Sahwan, W.; Alkhatieb, A.; Schütt, B. A Brief Review of Machine Learning Algorithms in Forest Fires Science. Appl. Sci. 2023, 13, 8275. [Google Scholar] [CrossRef]
Ghali, R.; Akhloufi, M.A.; Jmal, M.; Souidene Mseddi, W.; Attia, R. Wildfire Segmentation Using Deep Vision Transformers. Remote Sens. 2021, 13, 3527. [Google Scholar] [CrossRef]
Garcia, T.; Ribeiro, R.; Bernardino, A. Wildfire Aerial Thermal Image Segmentation Using Unsupervised Methods: A Multilayer Level Set Approach. Int. J. Wildland Fire 2023, 32, 435–447. [Google Scholar] [CrossRef]
Lee, Y.J.; Jung, H.G.; Suhr, J.K. Semantic Segmentation Network Slimming and Edge Deployment for Real-Time Forest Fire or Flood Monitoring Systems Using Unmanned Aerial Vehicles. Electronics 2023, 12, 4795. [Google Scholar] [CrossRef]
Feng, H.; Qiu, J.; Wen, L.; Zhang, J.; Yang, J.; Lyu, Z.; Liu, T.; Fang, K. U3UNet: An Accurate and Reliable Segmentation Model for Forest Fire Monitoring Based on UAV Vision. Neural Netw. 2025, 185, 107207. [Google Scholar] [CrossRef]
Saxena, V.; Jain, Y.; Mittal, S. A Deep Learning Based Approach for Semantic Segmentation of Small Fires from UAV Imagery. Remote Sens. Lett. 2025, 16, 277–289. [Google Scholar] [CrossRef]
Yuan, J.; Yang, M.; Wang, H.; Ding, X.; Li, S.; Gong, W. SAMFA: A Flame Segmentation Algorithm for Infrared and Visible Aerial Images in the Same Scene. Drones 2025, 9, 217. [Google Scholar] [CrossRef]
Wang, Z.; Zheng, C.; Yin, J.; Tian, Y.; Cui, W. A Semantic Segmentation Method for Early Forest Fire Smoke Based on Concentration Weighting. Electronics 2021, 10, 2675. [Google Scholar] [CrossRef]
Niu, K.; Wang, C.; Xu, J.; Yang, C.; Zhou, X.; Yang, X. An Improved YOLOv5s-Seg Detection and Segmentation Model for the Accurate Identification of Forest Fires Based on UAV Infrared Image. Remote Sens. 2023, 15, 4694. [Google Scholar] [CrossRef]
Zou, Y.; Zhang, Z.; Zhang, H.; Li, C.-L.; Bian, X.; Huang, J.-B.; Pfister, T. Pseudoseg: Designing Pseudo Labels for Semantic Segmentation. arXiv 2020, arXiv:2010.09713. [Google Scholar]
Mendel, R.; De Souza, L.A.; Rauber, D.; Papa, J.P.; Palm, C. Semi-Supervised Segmentation Based on Error-Correcting Supervision. In Proceedings, Part XXIX 16, Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 141–157. [Google Scholar]
Yuan, J.; Ge, J.; Wang, Z.; Liu, Y. Semi-Supervised Semantic Segmentation with Mutual Knowledge Distillation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5436–5444. [Google Scholar]
Fan, S.; Zhu, F.; Feng, Z.; Lv, Y.; Song, M.; Wang, F.-Y. Conservative-Progressive Collaborative Learning for Semi-Supervised Semantic Segmentation. IEEE Trans. Image Process. 2023, 32, 6183–6194. [Google Scholar] [CrossRef] [PubMed]
Huang, H.; Xie, S.; Lin, L.; Tong, R.; Chen, Y.-W.; Li, Y.; Wang, H.; Huang, Y.; Zheng, Y. Semicvt: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11340–11349. [Google Scholar]
Hu, X.; Jiang, L.; Schiele, B. Training Vision Transformers for Semi-Supervised Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4007–4017. [Google Scholar]
Wang, H.; Zhang, Q.; Li, Y.; Li, X. Allspark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3627–3636. [Google Scholar]

Figure 1. Sample images from the Flame and D-Fire datasets.

Figure 2. Overall structure of the proposed semi-supervised semantic segmentation model.

Figure 3. Example of Flip data augmentation.

Figure 4. Example of CutMix data augmentation.

Figure 5. Structure of the Lagrange Interpolation Module.

Figure 6. The visual comparisons of segmentation results from different models on the Flame dataset.

Figure 7. The visual comparisons of segmentation results from different models on the D-Fire dataset.

Figure 8. Class activation maps after integrating different modules.

Table 1. The comparative experimental results of different methods on the Flame dataset.

Methods	Backbone	mIoU (%)
Methods	Backbone	1/8	1/4	1/2	Full
ECS [51]	ResNet	72.4	74.6	79.6	87.6
PseudoSeg [50]	ResNet	71.9	73.2	77.7	86.3
DCC [28]	ResNet	73.5	75.4	80.3	88.5
ELN [29]	ResNet	72.6	73.3	79.4	87.4
MKD [52]	ResNet	70.8	73.5	79.9	88.1
CPCL [53]	ResNet	69.3	72.8	78.6	86.9
SemiCVT [54]	Transformer	74.6	77.6	82.4	90.4
S4Former [55]	Transformer	72.0	76.3	83.6	90.2
Allspark [56]	Transformer	70.9	75.8	82.7	90.0
ESL [22]	ResNet	70.2	73.1	80.4	87.8
UniMatch [30]	ResNet	72.7	73.8	81.5	89.3
Ours	ResNet	74.2	76.3	82.6	90.4
Ours	Transformer	75.4	78.5	84.5	91.6

Table 2. The comparative experimental results of different methods on the D-Fire dataset.

Methods	Backbone	mIoU (%)
Methods	Backbone	1/8	1/4	1/2	Full
ECS [51]	ResNet	75.9	76.4	81.4	89.4
PseudoSeg [50]	ResNet	72.8	76.8	80.5	89.7
DCC [28]	ResNet	75.4	77.5	82.3	90.6
ELN [29]	ResNet	74.8	76.2	82.6	91.2
MKD [52]	ResNet	73.6	75.3	83.7	91.9
CPCL [53]	ResNet	72.4	74.8	80.1	90.2
SemiCVT [54]	Transformer	77.3	80.2	85.4	92.4
S4Former [55]	Transformer	76.6	80.4	85.8	92.7
Allspark [56]	Transformer	75.9	78.7	83.6	91.9
ESL [22]	ResNet	73.9	75.9	82.4	90.3
UniMatch [30]	ResNet	75.4	76.8	84.3	91.5
Ours	ResNet	77.6	79.5	85.8	92.8
Ours	Transformer	79.8	82.5	86.7	93.7

Table 3. Ablation experiment results on Flame and D-Fire datasets.

Methods	Backbone	LIM	PPC	mIoU (%)
Methods	Backbone	LIM	PPC	Flame	D-Fire
1/8	√			70.4	72.8
	√	√		72.2	75.7
	√		√	73.5	76.1
	√	√	√	74.2	77.6
1/4	√			71.8	73.9
	√	√		73.6	77.7
	√		√	74.8	78.4
	√	√	√	76.3	79.5
1/2	√			77.6	79.4
	√	√		79.9	83.5
	√		√	80.4	83.7
	√	√	√	82.6	85.8

Table 4. Performance comparison of different structural enhancement modules.

Method	mIoU (%)
Method	Flame	D-Fire
Baseline	77.6	79.4
Baseline + CBAM	79.4	83.4
Baseline + SE-Block	78.5	82.9
Baseline + ECA-Block	78.6	81.1
Baseline + LIM	79.9	83.5

Table 5. Comparison of parameter count, segmentation accuracy, and training time.

Method	Backbone	mIoU (%)	Params (M)	Training Time (s)
ECS [51]	ResNet	79.6	38.8	341
PseudoSeg [50]	ResNet	77.7	37.4	334
DCC [28]	ResNet	80.3	40.3	354
ELN [29]	ResNet	79.4	41.9	369
MKD [52]	ResNet	79.9	39.5	376
CPCL [53]	ResNet	78.6	42.2	392
SemiCVT [54]	Transformer	82.4	53.5	513
S4Former [55]	Transformer	83.6	55.8	502
Allspark [56]	Transformer	82.7	54.0	493
ESL [22]	ResNet	80.4	39.8	363
UniMatch [30]	ResNet	81.5	45.5	409
Ours	ResNet	82.6	37.9	338
Ours	Transformer	84.5	53.7	485

Table 6. The mIoU (%) under different input sizes and labeling ratios.

Input Size	mIoU (%)
Input Size	1/8	1/4	1/2	Full
256 × 256	74.2	76.3	82.6	90.4
384 × 384	74.6	76.9	82.9	90.7
512 × 512	74.8	77.2	83.0	90.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, Y.; Wei, W.; Guo, J.; Lin, H.; Xu, Y. A Semi-Supervised Wildfire Image Segmentation Network with Multi-Scale Structural Fusion and Pixel-Level Contrastive Consistency. Fire 2025, 8, 313. https://doi.org/10.3390/fire8080313

AMA Style

Sun Y, Wei W, Guo J, Lin H, Xu Y. A Semi-Supervised Wildfire Image Segmentation Network with Multi-Scale Structural Fusion and Pixel-Level Contrastive Consistency. Fire. 2025; 8(8):313. https://doi.org/10.3390/fire8080313

Chicago/Turabian Style

Sun, Yong, Wei Wei, Jia Guo, Haifeng Lin, and Yiqing Xu. 2025. "A Semi-Supervised Wildfire Image Segmentation Network with Multi-Scale Structural Fusion and Pixel-Level Contrastive Consistency" Fire 8, no. 8: 313. https://doi.org/10.3390/fire8080313

APA Style

Sun, Y., Wei, W., Guo, J., Lin, H., & Xu, Y. (2025). A Semi-Supervised Wildfire Image Segmentation Network with Multi-Scale Structural Fusion and Pixel-Level Contrastive Consistency. Fire, 8(8), 313. https://doi.org/10.3390/fire8080313

Article Menu

A Semi-Supervised Wildfire Image Segmentation Network with Multi-Scale Structural Fusion and Pixel-Level Contrastive Consistency

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation

2.2. Semi-Supervised Semantic Segmentation

2.3. Semantic Segmentation of Forest Fire Images

3. Materials and Methods

3.1. Dataset

3.2. Methodology

3.2.1. Data Augmentation

3.2.2. Encoding Stages

3.2.3. Lagrange Interpolation Module

3.2.4. Pixel Contrast Consistency

4. Experiments and Results

4.1. Experiment Setup

4.2. Comparison with Other Methods

4.3. Ablation Experiments

4.4. Computational Complexity and Scalability Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI