Deep Learning-Based Fire Hotspot Detection Using HY-1E COCTS2 Data in the Three-North Region of China

Zhou, Yangyang; Zhu, Haitian; Song, Yan; Huang, Lei; Cui, Limin; Zhang, Weiliang; Fang, Yinghui

doi:10.3390/su18115512

Open AccessArticle

Deep Learning-Based Fire Hotspot Detection Using HY-1E COCTS2 Data in the Three-North Region of China

by

Yangyang Zhou

^1,2,

Haitian Zhu

^1,2,*,

Yan Song

³,

Lei Huang

¹,

Limin Cui

¹,

Weiliang Zhang

¹ and

Yinghui Fang

¹

National Satellite Ocean Application Service, Beijing 100081, China

²

State Key Laboratory of Satellite Ocean Environment Dynamics, National Satellite Ocean Application Service, Beijing 100081, China

³

School of Geography and Information Engineering, China University of Geosciences, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(11), 5512; https://doi.org/10.3390/su18115512

Submission received: 17 April 2026 / Revised: 15 May 2026 / Accepted: 19 May 2026 / Published: 1 June 2026

(This article belongs to the Special Issue Research on Ecological and Environmental Sustainability Based on Remote Sensing and Geographic Information Systems)

Download

Browse Figures

Versions Notes

Abstract

Accurate and timely wildfire hotspot detection is essential for ecological sustainability and supporting climate resilience strategies. Although sensors such as MODIS and VIIRS have been widely used for wildfire detection, the potential of ocean color satellites for terrestrial wildfire monitoring remains largely unexplored. In this study, a Spectral–Spatial Attention U-Net (SSA-UNet) framework is proposed for wildfire hotspot detection using multispectral observations from the HY-1E Coastal Zone Color Scanner II (COCTS2) over the Three-North region of China. The proposed framework integrates spectral attention to enhance fire-sensitive bands and spatial attention to capture contextual wildfire patterns under complex environmental conditions. Experimental results show that SSA-UNet achieves a Precision of 0.8913, Recall of 0.7961, and F1-score of 0.8680, outperforming conventional threshold-based approaches and baseline deep learning models. Ablation experiments further demonstrate the effectiveness of the spectral–spatial attention mechanism, while band analysis highlights the important contributions of near-infrared, shortwave infrared, and thermal infrared observations for wildfire hotspot detection. The real wildfire case analysis further confirms the practical applicability of the proposed framework. The results demonstrate that HY-1E COCTS2 data have considerable potential for large-scale terrestrial wildfire monitoring when combined with deep learning techniques.

Keywords:

HY-1E; COCTS2; fire hotspot detection; deep learning; satellite remote sensing

1. Introduction

Wildfires are among the most destructive natural hazards, causing severe ecological degradation, atmospheric pollution, biodiversity loss, and substantial socio-economic damage [1,2]. In recent decades, global wildfire activity has exhibited a clear trend toward increasing frequency, intensity, and spatial extent under the combined influence of climate change and extreme weather events [3]. This escalating threat poses a direct challenge to achieving long-term environmental sustainability and ecological security goals worldwide. Accurate and timely fire hotspot detection has therefore become more than a technical issue; it is a critical requirement for supporting sustainable development through ecosystem conservation, climate adaptation, disaster mitigation, and community resilience enhancement.

Previous studies have demonstrated that prolonged droughts, persistent heatwaves, low relative humidity, and strong wind conditions significantly increase vegetation flammability and accelerate fire spread [4,5]. Bowman et al. [2] emphasized that climate-driven changes in temperature and precipitation regimes are fundamentally reshaping global fire dynamics, while Abatzoglou and Williams [6] reported that anthropogenic warming has substantially intensified wildfire occurrence by increasing fuel aridity. Particularly in arid and semi-arid regions, elevated surface temperatures and reduced soil moisture further enhance ignition probability and fire propagation [7]. Effectively monitoring and managing wildfire risks in these vulnerable regions is integral to sustainable land management and climate adaptation strategies. Furthermore, a meso-scale risk analysis in Lovcen National Park, Montenegro, demonstrated the utility of medium-resolution satellite data in quantifying wildfire vulnerability by integrating climatic, vegetative, and topographic factors [8]. These studies collectively highlight the importance of accurate and timely fire hotspot detection for wildfire early warning, emergency response, and post-fire assessment within sustainable disaster risk reduction frameworks.

Satellite remote sensing has become the primary approach for large-scale and continuous wildfire monitoring because of its wide spatial coverage and frequent revisit capability [9]. From a physical perspective, fire hotspot detection is fundamentally based on the enhanced radiative emission of high-temperature targets, especially in the shortwave infrared (SWIR) and thermal infrared (TIR) spectral regions. The classical dual-band sub-pixel fire model proposed by Dozier [10] established the theoretical foundation for estimating fire temperature and sub-pixel burning area from satellite observations. Building upon this framework, contextual thermal anomaly algorithms were subsequently developed and widely implemented in operational fire products derived from MODIS and VIIRS sensors [11,12,13,14,15]. Among them, the NASA Fire Information for Resource Management System (FIRMS) provides globally consistent active fire products with long-term operational stability and has become one of the most widely used fire monitoring databases worldwide [16]. The continuous advancement of satellite-based fire monitoring technologies plays an increasingly important role in supporting sustainable forest management, ecological protection, and climate governance.

Although MODIS- and VIIRS-based fire products have demonstrated considerable success, conventional threshold and contextual algorithms still face several limitations. Under moderate spatial resolution, fire pixels are often mixed with surrounding background components, leading to reduced separability between fire signals and non-fire surfaces [17]. This issue becomes particularly severe in heterogeneous environments such as deserts, bare soil regions, urban surfaces, and cloud-contaminated areas, where spectrally similar high-temperature backgrounds can substantially increase false alarms [18]. Consequently, wildfire detection is not merely a thermal anomaly extraction problem but also a complex spectral–spatial discrimination task under varying environmental conditions.

To improve fire detection capability, recent studies have increasingly explored multispectral remote sensing approaches by jointly utilizing visible, near-infrared (NIR), shortwave infrared (SWIR), and thermal infrared (TIR) observations [19,20]. Multispectral observations can simultaneously characterize vegetation structure, surface moisture conditions, and thermal anomalies, thereby providing more comprehensive information for fire-background discrimination [21]. In parallel, machine learning techniques such as Random Forest (RF) and Support Vector Machine (SVM) have been introduced to construct multidimensional feature spaces for fire detection [22]. More recently, deep learning methods have demonstrated superior capability in modeling complex nonlinear spectral–spatial relationships within remote sensing imagery [18,23]. Convolutional neural networks (CNNs), fully convolutional networks (FCNs), and transformer-based architectures have shown promising performance in wildfire mapping and burned-area segmentation tasks [24,25,26]. Encoder–decoder frameworks such as U-Net [27] are particularly effective for pixel-level prediction because of their ability to integrate multi-scale semantic and spatial features. These advances provide a pathway toward more intelligent and resilient wildfire monitoring capabilities that can support sustainability goals through data-driven approaches.

To further enhance feature representation, attention mechanisms have been widely incorporated into deep learning models. Channel attention and spatial attention modules enable adaptive emphasis on informative spectral bands and spatial regions, thereby improving the discrimination of target features under complex backgrounds [28,29,30]. The Convolutional Block Attention Module (CBAM) [28] and dual attention mechanisms [29] have achieved strong performance in remote sensing scene analysis and semantic segmentation tasks.

The Chinese HY-1E satellite has provided a new opportunity for multispectral wildfire monitoring. HY-1E carries the second Chinese Ocean Color and Temperature Scanner (COCTS2), which provides multispectral observations spanning visible, NIR, SWIR, and thermal infrared wavelengths [31]. A comparison of the principal characteristics of commonly used fire-monitoring datasets is presented in Table 1. Although originally designed for marine observation, COCTS2 provides multispectral observations ranging from visible to thermal infrared bands with a spatial resolution suitable for terrestrial thermal anomaly detection. Compared with traditional ocean color sensors, COCTS2 exhibits enhanced radiometric sensitivity and improved thermal infrared capability, making it potentially valuable for wildfire hotspot monitoring. Preliminary studies have already demonstrated the feasibility of HY-1E observations for detecting large wildfire events [32]. Exploring and validating new data sources like HY-1E COCTS2 for terrestrial applications aligns with the principle of promoting innovation and infrastructure for sustainability, expanding the available resources for global environmental stewardship.

Despite recent advances in multispectral fire detection, several important challenges remain unresolved. First, most traditional threshold-based methods rely on handcrafted rules and fixed empirical parameters, limiting their robustness under heterogeneous environmental conditions. Second, the application potential of ocean color satellites for terrestrial wildfire monitoring remains largely unexplored, particularly for HY-1E COCTS2.

To address these limitations, this study proposes a deep learning-based fire hotspot detection framework using HY-1E COCTS2 multispectral observations over the Three-North region of China. Specifically, a Spectral–Spatial Attention U-Net (SSA-UNet) is developed to jointly model spectral dependencies and spatial contextual information for robust fire hotspot extraction under complex background conditions. The proposed framework aims to improve the reliability and operational applicability of wildfire monitoring in ecologically sensitive regions. Enhanced wildfire detection capability can support the protection of critical ecosystems such as the Three-North Shelterbelt Project, which plays a vital role in combating desertification, preserving ecological stability, and promoting sustainable development in northern China.

2. Study Area and Data Sources

2.1. Study Area

The Three-North region (i.e., the Three-North Shelterbelt Program zone) represents a critical ecological security barrier in China and is also a high-incidence area for forest fires. The region spans Northwest, North, and Northeast China, extending from 73°26′–127°50′ E and 33°30′–50°12′ N, with a total area of approximately 4.07 million km². It covers extensive arid and semi-arid zones in northern China.

The study area is characterized by a continental monsoon climate, with pronounced interannual and seasonal variability in precipitation. Dry conditions and strong winds during spring and autumn, combined with fuel accumulation and heterogeneous vegetation structure, facilitate the occurrence of forest fire events with varying scales and intensities. In addition, forest distribution in the Three-North region is relatively fragmented, and fire hotspots are typically small-scale and spatially localized. Under moderate spatial resolution, mixed pixel effects are significant, posing challenges for traditional threshold-based fire hotspot detection methods. Therefore, the study area provides a representative testbed with complex background conditions and diverse fire regimes for evaluating multispectral fire hotspot detection approaches.

In this study, the spatial extent of the Three-North Shelterbelt Program is adopted as the study area for regional-scale fire hotspot detection and validation (Figure 1).

2.2. Data Sources

This study integrates multispectral remote sensing data and auxiliary datasets for fire hotspot detection and validation, including Three-North boundary data, MODIS fire hotspot products (FIRMS), and HY-1E satellite observations.

2.2.1. Three-North Boundary Data

The boundary dataset of the Three-North Shelterbelt Program is used to define the study area and provide spatial constraints for fire hotspot analysis. The dataset is obtained from the National Ecosystem Science Data Center (http://www.nesdc.org.cn/) and is provided in vector format with a complete topological structure and clear administrative attributes.

During preprocessing, coordinate system transformation and geometric correction are performed to ensure spatial consistency with remote sensing data.

2.2.2. MODIS Fire Hotspot Data (FIRMS)

Fire reference data were obtained from the MODIS Collection 6 Near Real-Time (NRT) Active Fire Product provided by the Fire Information for Resource Management System (FIRMS) developed by NASA [15]. These products are generated based on well-established thermal infrared (TIR) thresholding and contextual background comparison algorithms, with long-term operational stability and global consistency.

To ensure temporal consistency, MODIS fire hotspot data matching the overpass time of the HY-1E satellite are selected for spatial comparison and accuracy evaluation. It should be noted that, due to its moderate spatial resolution, MODIS data are more suitable for medium- to large-scale fire monitoring, while uncertainties remain in representing small-scale fire hotspots. Therefore, MODIS data are used as reference data rather than absolute ground truth in this study.

It should be emphasized that FIRMS data are not treated as absolute ground truth in this study. Instead, they are used as a reference dataset for performance evaluation. Due to spatial resolution limitations and algorithm uncertainties, FIRMS products may omit small or low-intensity fires and introduce positional uncertainties. Therefore, evaluation results should be interpreted as relative agreement rather than absolute accuracy.

2.2.3. HY-1E Satellite Data

The HY-1E satellite, launched in November 2023, carries the second-generation Chinese Ocean Color and Temperature Scanner (COCTS2) as its primary payload. HY-1E provides observations with a revisit frequency of approximately twice per day. COCTS2 provides multispectral observations with a nadir ground sampling distance (GSD) of 500 m and a swath width exceeding 3000 km, enabling large-scale Earth observation with high temporal coverage.

The HY-1E COCTS2 Level-1C (L1C) data used in this study were obtained from the China Marine Satellite Data Distribution System (https://osdds.nsoas.org.cn/). The L1C products have undergone radiometric calibration and geometric correction, providing top-of-atmosphere radiance data with reliable radiometric consistency and geolocation accuracy. These preprocessing procedures ensure suitability for multispectral joint analysis and deep learning-based fire hotspot detection.

The COCTS2 sensor includes visible, near-infrared (NIR), shortwave infrared (SWIR), mid-wave thermal infrared (MWIR), and long-wave thermal infrared (LWIR) channels, allowing simultaneous observation of vegetation conditions, thermal anomalies, and background surface temperature. In this study, the term “thermal infrared (TIR)” collectively refers to the MWIR (3.74 μm) and LWIR (10.8–12 μm) channels used for thermal anomaly detection.

Among the available spectral channels, the SWIR and thermal infrared bands are particularly important for wildfire hotspot detection. The SWIR band (~3.74 μm) is highly sensitive to high-temperature thermal radiation emitted by active fires and is effective for distinguishing fire pixels from surrounding background surfaces. The long-wave thermal infrared bands (~10.8 μm and ~12.0 μm) provide complementary information regarding background temperature and thermal contrast, which helps suppress false alarms caused by sunlit surfaces and heterogeneous land-cover conditions. The combination of SWIR and TIR information therefore provides improved robustness for wildfire hotspot identification under complex environmental backgrounds.

Compared with conventional wildfire monitoring sensors, COCTS2 exhibits spectral characteristics generally comparable to MODIS and VIIRS in fire-sensitive bands, while differences remain in bandwidth configuration, radiometric response, and sensor design objectives. Detailed information regarding the spectral characteristics and primary applications of HY-1E COCTS2 bands is summarized in Table 2.

For this study, twelve COCTS2 images were selected, as shown in Figure 2. Nine images, acquired during the fire-prone seasons of 2025 (March–May and September–November), were used for model training and validation. An additional three images, acquired in March 2026, were reserved as a completely independent test set to evaluate the model’s generalization capability to unseen temporal data. All selected images have a cloud cover of less than 20% and encompass diverse vegetation types and terrain conditions across the study area. To ensure consistency, all datasets underwent uniform radiometric normalization and were reprojected to a common coordinate system for spatial alignment.

3. Method

This study develops a Spectral–Spatial Attention U-Net (SSA-UNet) by fully leveraging the multispectral observation capability of HY-1E COCTS2 data and integrating adaptive spectral weighting and spatial context modeling into the classic U-Net architecture. As shown in Figure 3, the proposed framework consists of four major stages: sample construction with label generation, model training based on the attention-enhanced architecture, and fire hotspot extraction combined with post-processing.

First, the original L1C multispectral imagery from COCTS2 is preprocessed, including band selection, radiometric normalization, and spatial reference unification, to ensure spectral consistency and spatial alignment across multi-temporal images. Next, using FIRMS fire point data spatially and temporally matched to HY-1E imagery, binary label masks are generated for model training, and a sliding-window strategy is adopted to construct training samples. Subsequently, an SSA-UNet model is trained to perform pixel-level fire detection by jointly learning spectral and spatial features. Finally, the trained model is applied to full-image imagery using a sliding-window inference approach, followed by post-processing to extract fire hotspot regions and their spatial attributes.

3.1. Data Preprocessing

To ensure the consistency and usability of multispectral remote sensing data, systematic preprocessing is applied to the Level-1C imagery from the HY-1E satellite COCTS2.

(1): Band selection

Based on the spectral configuration of COCTS2 and the radiative characteristics of high-temperature targets, five bands are selected to construct the multispectral input dataset. These bands cover the near-infrared (NIR) and shortwave infrared (SWIR) regions, which are sensitive to high-temperature radiation and surface moisture variation.

(2): Radiometric normalization

To reduce radiometric discrepancies among different HY-1E COCTS2 scenes and improve the numerical stability of deep learning training, all spectral bands were normalized using standard score normalization (z-score normalization):

X_{n o r m} = \frac{X - μ}{σ + ϵ}

(1)

where X denotes the original radiance, μ and σ represent the mean and standard deviation of the image, respectively, and ϵ = 1 × 10⁻⁶ is a small constant to avoid division by zero.

The z-score normalization method was selected instead of other commonly used approaches such as min–max scaling because the multispectral bands of HY-1E COCTS2 exhibit substantially different radiometric ranges and statistical distributions. Standardization to zero mean and unit variance reduces inter-band scale discrepancies and prevents high-value thermal bands from dominating the optimization process during network training.

In addition, wildfire hotspot detection typically involves strong thermal anomalies and highly heterogeneous background conditions, including deserts, bare soil, smoke, and cloud-contaminated regions. Compared with min–max normalization, z-score normalization is less sensitive to extreme radiance values caused by saturated fire pixels or cloud edges, thereby preserving the statistical structure of the majority of background pixels.

The standardized inputs also improve convergence stability and facilitate more effective feature learning within the proposed spectral–spatial attention framework of SSA-UNet.

(3): Spatial reference unification

To ensure spatial consistency across multi-temporal imagery, all datasets are reprojected into the WGS84 UTM coordinate system.

3.2. Label Generation and Sample Construction

The NASA FIRMS fire point product is used as the reference for generating training labels and conducting evaluations. For each HY-1E COCTS2 image, all MODIS and VIIRS fire detections within a ±6 h temporal window relative to the satellite overpass are selected to ensure temporal consistency while accounting for potential fire dynamics.

Since FIRMS provides point-based detections rather than pixel-level fire perimeters, it cannot be directly used as segmentation ground truth. Therefore, a semi-manual annotation strategy is adopted to generate training labels. Specifically, FIRMS fire points are first used as spatial anchors to locate candidate fire regions. Then, fire boundaries are delineated through visual interpretation of multispectral imagery, guided by the following strict criteria:

(1): Significant radiance enhancement in SWIR and TIR bands;
(2): Spatial continuity and morphological consistency of high-temperature regions;
(3): Temporal agreement with FIRMS detections;
(4): Exclusion of ambiguous regions (e.g., cloud edges, bright bare soil, or industrial heat sources).

This strategy ensures that the generated fire polygons represent physically meaningful fire-affected areas while minimizing subjectivity and labeling noise. The resulting vector polygons are rasterized to produce binary masks for supervised learning. The workflow of training label generation and sample construction is summarized in Figure 4, which demonstrates how FIRMS fire points are used as spatial–temporal anchors and further refined through multispectral visual interpretation and strict boundary delineation criteria.

To construct the training dataset, a sliding-window sampling strategy is applied. Multispectral image patches of size 128 × 128 pixels are extracted with a stride of 64 pixels, resulting in 50% overlap between adjacent patches. This configuration preserves spatial context and increases sample diversity. Each image patch is paired with its corresponding binary mask.

Due to the extreme class imbalance inherent in fire detection tasks, where fire pixels typically account for less than 0.01% of the total area, negative samples are randomly down-sampled to maintain a controlled positive-to-negative ratio of approximately 1:2. In total, 3240 samples are generated, including 1080 fire-containing patches and 2160 non-fire patches.

The dataset consists of 12 HY-1E COCTS2 images. Among them, 9 images acquired in 2025 are used for training and validation (split ratio 8:2), while 3 images acquired in March 2026 are reserved as a completely independent test set. The test dataset is strictly separated in both temporal and spatial domains, ensuring no overlap with training data and providing a robust assessment of model generalization capability. Detailed information on the selected HY-1E COCTS2 images and corresponding FIRMS fire point counts is summarized in Table 3.

To improve model generalization and robustness under limited training data, data augmentation techniques were applied during training. Specifically, each training patch was randomly augmented using the following transformations:

(1): Random horizontal and vertical flipping;
(2): Random rotation (0°, 90°, 180°, 270°);
(3): Random brightness and contrast adjustment (±10%);
(4): Gaussian noise perturbation with low variance.

These augmentations were applied on-the-fly during training to increase sample diversity and reduce overfitting, particularly under severe class imbalance conditions.

3.3. Fire Hotspot Detection Model: SSA-UNet

To improve feature representation for fire detection, a spectral–spatial attention-enhanced U-Net (SSA-UNet) is proposed.

3.3.1. Network Architecture

The network takes multispectral image patches (5 bands, 128 × 128 pixels) as input and outputs a probability map with the same spatial resolution, where each pixel represents the likelihood of a fire hotspot. As shown in Figure 5, the proposed network adopts a typical encoder–decoder architecture composed of three major components: an encoder, a bottleneck layer, and a decoder.

The encoder (downsampling path) contains four convolutional blocks. Each block consists of two 3 × 3 convolutional layers, followed by Batch Normalization and ReLU activation, and a 2 × 2 max-pooling layer. The number of feature channels increases progressively from 32 to 256, while the spatial resolution is reduced by a factor of 2 at each stage. This structure enables hierarchical extraction of multi-scale semantic features.

The bottleneck layer contains two 3 × 3 convolutional layers with 512 channels, capturing high-level semantic representations of fire hotspots.

The decoder (upsampling path) is symmetric to the encoder and consists of four upsampling blocks. Each block includes a 2 × 2 transposed convolution for upsampling, followed by concatenation with the corresponding encoder features via skip connections, and two 3 × 3 convolutional layers with Batch Normalization and ReLU activation. The number of channels is gradually reduced from 256 to 32, enabling precise spatial reconstruction.

Output layer. A 1 × 1 convolution is applied to map the feature representation to a single-channel probability map, followed by a Sigmoid activation function:

P(x,y) ∈ [0, 1]

(2)

where P denotes the predicted probability of a fire hotspot at pixel (x,y).

3.3.2. Spectral–Spatial Attention Mechanism

To enhance the discriminative capability of the model, a spectral–spatial attention (SSA) module is introduced.

(1): Spectral Attention (Channel Attention)

In the spectral attention module, channel-wise attention is implemented using a lightweight multilayer perceptron (MLP). Specifically, both global average pooling and global max pooling are applied to extract complementary channel-wise descriptors. The average pooling operation captures the global distribution of feature responses, while max pooling emphasizes the most salient activations, enabling the aggregation of diverse statistical characteristics.

These descriptors are then passed through a shared two-layer fully connected network with a reduction ratio of r = 8, where the hidden layer dimension is C/r and C denotes the number of input channels. A ReLU activation function is applied between the two layers, followed by a sigmoid function to generate normalized channel attention weights.

M_{c} (X) = σ (M L P (A v g P o o l (X))) + M L P (M a x P o o l (X))

(3)

where Mc denotes channel attention weights.

(2): Spatial Attention

In the spatial attention module, the operation f^(7×7) is implemented as a convolutional layer with a kernel size of 7 × 7, followed by Batch Normalization and a Sigmoid activation function. The input feature maps are first aggregated using channel-wise average pooling and max pooling, and then passed through the convolution layer to produce spatial attention maps.

Spatial attention focuses on the spatial distribution of fire regions:

M_{s} (X) = σ (f^{7 \times 7} ([A v g (X); M a x (X)]))

(4)

where Ms highlights fire-related spatial regions.

(3): SSA Fusion

The final SSA module combines both mechanisms:

X^{'} = M_{s} (M_{c} (X) \times X)

(5)

This design enables the model to simultaneously emphasize fire-sensitive spectral bands and capture spatial context of fire distribution.

(4): Integration into U-Net

The SSA modules are embedded into convolutional blocks in both encoder and decoder stages, allowing multi-scale attention learning.

Additionally, an input-level channel attention module is applied to refine raw spectral features.

3.3.3. Loss Function and Optimization

Due to the extremely low proportion of fire hotspot pixels (approximately 0.01%), the task suffers from severe class imbalance. To address this issue, a hybrid loss function combining Binary Cross-Entropy (BCE) loss and Dice loss is adopted:

L_{t o t a l} = L_{B C E} + L_{D i c e}

(6)

The BCE loss is defined as:

L_{B C E} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log p_{i} + (1 - y_{i}) \log (1 - p_{i})]

(7)

The Dice loss is defined as:

L_{D i c e} = 1 - \frac{2 \sum_{i = 1}^{N} p_{i} y_{i} + ϵ}{\sum_{i = 1}^{N} p_{i} + \sum_{i = 1}^{N} y_{i} + ϵ}

(8)

where yi and pi denote the ground truth and predicted probability of pixel i, respectively, and ϵ is a small constant to avoid numerical instability. This combined loss function simultaneously optimizes pixel-wise classification accuracy and region-level overlap, effectively mitigating class imbalance.

The model is trained using the Adam optimizer with an initial learning rate of 1 × 10⁻⁴, a batch size of 16, and 50 epochs. A ReduceLROnPlateau strategy is employed to adaptively adjust the learning rate when the validation F1-score does not improve for five consecutive epochs, enhancing convergence stability.

3.4. Fire Hotspot Extraction and Post-Processing

The trained model is applied to full-image imagery to generate fire hotspot detection results through a sliding-window inference strategy.

First, image patches of 128 × 128 pixels are extracted with a stride of 64 pixels, and each patch is processed to obtain a probability map. Overlapping regions are averaged to produce a continuous probability map P(x,y).

A threshold-based segmentation is then applied:

B (x, y) = \{\begin{matrix} 1, P (x, y) > T \\ 0, P (x, y) \leq T \end{matrix}

(9)

Instead of using an empirically fixed threshold, the decision threshold is determined through validation-based optimization. Specifically, a threshold sensitivity analysis is conducted, and the optimal threshold is selected by maximizing the F1-score. Subsequently, morphological opening and closing operations are performed to remove noise and fill small holes. Connected component analysis is applied to identify individual fire hotspot regions, and regions smaller than three pixels are removed as false detections. Finally, the binary mask is converted into vector polygons, from which spatial attributes including centroid, area, and mean probability are derived.

3.5. Evaluation Metrics

To quantitatively evaluate the performance of fire hotspot detection, a buffer-based spatial matching strategy is adopted to address the inherent mismatch between pixel-level predictions and point-based reference data. To ensure statistical robustness, all performance metrics are computed over the full test set, and spatial–temporal uncertainties in reference fire products are explicitly considered through buffer-based matching and temporal filtering strategies.

The reference fire product provides geolocated fire detections with known spatial uncertainties, typically ranging from 375 m (VIIRS) to 1 km (MODIS). Direct pixel-wise comparison between model outputs and FIRMS points is therefore not physically meaningful. To ensure a fair and scientifically sound evaluation, each FIRMS detection is expanded into a circular buffer with a radius of 1 km, accounting for geolocation uncertainty and spatial resolution differences.

A predicted fire pixel is considered a True Positive (TP) if it falls within any FIRMS buffer zone. Predictions outside all buffer zones are counted as False Positives (FP), while FIRMS buffer zones without any corresponding predictions are treated as False Negatives (FN).

Precision

P = \frac{T P}{T P + F P}

(10)

Recall

R = \frac{T P}{T P + F N}

(11)

F1-score

F 1 = \frac{2 P R}{P + R}

(12)

Intersection over Union (IoU)

I o U = \frac{T P}{T P + F P + F N}

(13)

Precision measures the reliability of detected fire hotspots, while Recall reflects the detection completeness. The F1-score provides a balanced evaluation of both metrics, and IoU quantifies the spatial overlap between predicted and ground truth masks.

Since the FIRMS product itself is not an absolute truth and has its own detection limits and errors, the above metrics reflect the consistency of the proposed method with this specific operational reference product, rather than absolute detection accuracy.

4. Results and Analysis

All experiments were conducted on a workstation equipped with an Intel Core i7-12700H CPU and 32 GB RAM under a unified experimental framework to ensure fair and reproducible evaluation. Model training and quantitative analyses were implemented using Python 3.9 and the PyTorch 2.0.1, and CUDA 11.8. Numerical calculations and statistical analyses were performed using NumPy 1.24.4, SciPy 1.11.4. Image processing and mask refinement relied on scikit-image 0.22.0. Geospatial data processing was conducted using GDAL 3.4.3 and Rasterio 1.3.11, while visualization and figure generation were carried out using Matplotlib 3.7.2 and QGIS 3.34.

4.1. Training Dynamics

The model was trained using the Adam optimizer with default momentum parameters β1 = 0.9 and β2 = 0.999, and ε = 1 × 10⁻⁸. A weight decay of 1 × 10⁻⁵ was applied to reduce overfitting.

The initial learning rate was set to 1 × 10⁻⁴. A ReduceLROnPlateau learning rate scheduler was employed to adaptively adjust the learning rate based on validation performance, with a reduction factor of 0.5, patience of 5 epochs, and a minimum learning rate of 1 × 10⁻⁶.

The model was trained for 50 epochs with a batch size of 16. Early stopping was not applied, as convergence was consistently observed within the predefined number of epochs.

4.1.1. Loss Convergence Behavior

The training process demonstrates a stable and well-converged optimization trajectory throughout the entire training period. As shown in Figure 6, both the training loss and validation loss decrease consistently with increasing epochs, indicating effective learning of wildfire-related spectral–spatial features by the proposed SSA-UNet model.

Rapid learning stage (Epoch 1–10): The model quickly captures discriminative spectral–spatial patterns of fire hotspots, leading to a sharp decrease in both training and validation losses.

Progressive optimization stage (Epoch 10–30): The loss continues to decline at a slower rate, indicating refinement of fine-grained features such as local texture and contextual contrast.

Convergence stage (Epoch 30–50): The validation loss stabilizes around its minimum (0.2778), with a consistently small gap between training and validation curves, suggesting good generalization and absence of overfitting.

Notably, validation loss is occasionally lower than training loss, which can be attributed to implicit regularization effects such as Batch Normalization and data augmentation.

4.1.2. Metric Evolution

Performance metrics, including Precision, Recall, F1-score, and Intersection over Union (IoU), exhibit consistent improvement throughout the training process. As illustrated in Figure 7, all evaluation metrics increase steadily with increasing epochs.

During early training, Recall increases rapidly, indicating that the model prioritizes detecting potential fire hotspots. As training progresses, Precision gradually improves, reflecting enhanced suppression of false positives.

In later stages, all metrics stabilize, with F1-score reaching ~0.88 and IoU ~0.77, demonstrating a balanced trade-off between detection sensitivity and reliability.

This dynamic reflects a typical precision–recall trade-off optimization process, where the model evolves from coarse detection to refined discrimination.

4.1.3. Learning Rate Adaptation

The adoption of the ReduceLROnPlateau strategy plays a critical role in stabilizing training. As performance improvement slows in the mid-to-late stages, the learning rate is adaptively reduced, allowing finer parameter updates and preventing oscillations around local minima. This contributes to the smooth convergence observed in the final training phase.

4.2. Quantitative Performance Evaluation

4.2.1. Threshold Optimization

The prediction outputs of the SSA-UNet model are continuous probability values ranging from 0 to 1, where each pixel value represents the estimated likelihood of wildfire hotspot occurrence. Rather than adopting a fixed empirical threshold, this study systematically evaluated model performance under multiple threshold settings to determine the optimal segmentation criterion for fire hotspot extraction. As illustrated in Figure 8, threshold values ranging from 0.1 to 0.9 with an interval of 0.1 were tested to analyze the sensitivity of the model to different probability cutoffs. Lower threshold values generally produce higher Recall because more potential fire pixels are retained; however, they also increase the risk of false-positive detections caused by background interference. Conversely, higher thresholds improve Precision by suppressing weak and uncertain predictions but may lead to omission of small or low-intensity fire hotspots.

The results reveal a clear trend: as the threshold increases, Precision gradually improves due to the suppression of false positives, while Recall decreases as more low-confidence fire pixels are discarded. Consequently, the F1-score exhibits a unimodal distribution, reaching its maximum at an intermediate threshold.

A threshold sensitivity analysis reveals that model performance is strongly influenced by the decision threshold. As the threshold increases, Precision improves while Recall decreases, reflecting stricter classification criteria. The optimal balance is achieved at T = 0.40, where F1-score reaches its maximum.

4.2.2. Confusion Matrix Analysis

As illustrated in Figure 9, among 10,614,832 evaluated pixels, 200,435 fire pixels are correctly detected (TP), while 31,113 pixels are falsely classified as fire (FP), and 29,825 fire pixels are missed (FN).

The relatively low number of false positives demonstrates strong background suppression capability, which is essential in complex environments such as deserts and cloud-contaminated regions. Meanwhile, the remaining false negatives are primarily associated with small-scale or low-intensity fires, highlighting the inherent difficulty of detecting weak thermal signals under moderate spatial resolution.

4.2.3. ROC and PR Curve Analysis

As shown in Figure 10a, the model achieves an Area Under the Receiver Operating Characteristic (ROC) Curve value of 0.998, indicating excellent separability between fire and non-fire classes across different classification thresholds. However, because wildfire hotspot detection is characterized by severe class imbalance, where fire pixels occupy only a very small proportion of the total image, ROC curves alone may overestimate practical detection performance. Therefore, the Precision–Recall (PR) curve was additionally evaluated to provide a more reliable assessment under sparse-target conditions. As illustrated in Figure 10b, the PR curve achieves an AUC value of 0.946, demonstrating strong robustness in identifying sparse wildfire hotspot targets while maintaining relatively high precision.

4.3. Comparative Evaluation

To comprehensively evaluate the effectiveness of the proposed method, we conducted comparative experiments against both conventional approaches and state-of-the-art deep learning models. To ensure a fair and rigorous comparison, all baseline methods were carefully optimized under a consistent experimental framework.

For traditional methods, key parameters (e.g., SWIR threshold, window size, and sensitivity coefficient) were systematically tuned using grid search on the validation set to maximize their respective F1-scores. This ensures that the performance of conventional approaches reflects their optimal capability rather than suboptimal parameter settings.

For deep learning models, identical training datasets, preprocessing procedures, network input sizes, loss functions, and optimization strategies were adopted.

4.3.1. Experimental Setup

Two traditional methods were selected as baseline approaches:

(1): SWIR threshold-based method

Fires typically exhibit significantly high brightness in the shortwave infrared band; therefore, high-temperature pixels can be identified by setting a threshold. This method uses the B13 (1640 nm) band of the HY-1E image for fire discrimination:

T_{S W I R} \geq 11000

(14)

When a pixel’s brightness value exceeds this threshold, it is classified as a fire pixel. Similar single-band threshold strategies have been widely used in early AVHRR and MODIS fire detection algorithms [26].

(2): Spatial context-based method (SCM)

To reduce the false alarm rate of the single-band threshold method, a spatial context-based approach is adopted. It follows the general framework of contextual fire detection algorithms widely used in satellite-based fire products, particularly those developed for MODIS and AVHRR sensors [8,27].

Unlike pixel-wise thresholding, this approach incorporates local neighborhood statistics to evaluate the relative anomaly of a candidate pixel with respect to its surrounding background. For each candidate pixel i, a local window Ωi of size w × w (typically 3 × 3 or 5 × 5) is defined. The local background statistics excluding the central pixel are computed as:

μ_{i} = \frac{1}{|Ω_{i}| - 1} \sum_{j \in Ω_{i}, j \neq i} L_{j}^{S W I R} σ_{i} = \sqrt{\frac{1}{|Ω_{i}| - 1} \sum_{j \in Ω_{i}, j \neq i} {(L_{j}^{S W I R} - μ_{i})}^{2}}

(15)

where μi and σi denote the local mean and standard deviation, respectively.

A pixel is classified as fire if it satisfies both a global intensity constraint and a local anomaly condition:

L_{i}^{S W I R} > T_{S W I R} L_{i}^{S W I R} - μ_{i} > k \times σ_{i}

(16)

where k is a sensitivity parameter (typically ranging from 2 to 4).

This contextual strategy ensures that detected fire pixels are not only spectrally bright but also significantly different from their local background, thereby effectively reducing commission errors caused by spatially homogeneous bright surfaces.

In addition, two widely used deep learning segmentation models were adopted for comparison: (3) U-Net and (4) DeepLabV3+.

To further investigate the effectiveness of attention mechanisms, an attention-enhanced baseline (CBAM-UNet) was also included.

To ensure a fair comparison, all deep learning models were trained under identical conditions. Specifically:, the same training and validation datasets were used, identical preprocessing procedures were applied, the input patch size was fixed at 128 × 128 pixels, the loss function was uniformly set to BCE + Dice loss, the Adam optimizer with an initial learning rate of 1 × 10⁻⁴ was used, and all models were trained for 50 epochs with a batch size of 16. This unified experimental protocol ensures that performance differences arise from model design rather than training strategies.

4.3.2. Ablation Study

To quantitatively evaluate the contribution of each attention component, an ablation study was conducted using four network configurations with different combinations of spectral attention and spatial attention modules. As summarized in Table 4, the experiments include the baseline U-Net, U-Net with spectral (channel) attention, U-Net with spatial attention, and the SSA-UNet integrating both attention mechanisms.

The quantitative results presented in Table 5 demonstrate that both attention modules contribute positively to wildfire hotspot detection performance, although their effects differ in terms of feature representation behavior. Compared with the baseline U-Net, the introduction of spectral attention improves Precision. This improvement indicates that spectral attention effectively enhances the selection of fire-sensitive spectral channels, particularly SWIR and thermal infrared bands, thereby suppressing false-positive responses caused by spectrally similar background regions. In contrast, the incorporation of spatial attention produces a relatively larger improvement in Recall. This result suggests that spatial attention enhances the network’s ability to capture contextual spatial dependencies and identify fragmented or weak wildfire hotspots under heterogeneous environmental conditions.

4.3.3. Quantitative Comparison

The quantitative comparison results on the validation dataset are presented in Table 6 and Figure 11.

The results demonstrate that traditional methods achieve relatively high Recall but suffer from low Precision due to their sensitivity to background noise and spectral ambiguity. Parameter Sensitivity Analysis of Traditional Methods: When applying traditional methods (the SWIR thresholding method and the spatial context method), we observed that their performance is highly sensitive to parameter settings (such as the threshold T_SWIR, local window size w, and sensitivity parameter k). To ensure a fair comparison with deep learning methods, we optimized these parameters on the validation set via grid search to maximize their respective F1-scores. The optimized spatial context method (w = 5, k = 3) achieved an absolute improvement of approximately 18% in precision (from 0.5500 to 0.6500) compared to the simple SWIR thresholding method, confirming that the introduction of local spatial statistics effectively suppresses false positives caused by bright backgrounds. Nevertheless, the performance ceiling of these two traditional methods remains significantly lower than that of deep learning models. This highlights that, on novel sensor data such as HY-1E COCTS2, models relying on fixed thresholds and simple statistical rules struggle to capture the nonlinear, multiscale spectral–spatial patterns between fires and complex backgrounds, whereas data-driven deep learning methods possess inherent advantages in this regard.

Deep learning models significantly improve overall performance by learning complex spectral–spatial patterns. Among them, DeepLabV3+ achieves strong performance due to its ability to capture multi-scale contextual information.

The proposed SSA-UNet outperforms all comparison methods, achieving the highest F1-score and IoU. Compared with the baseline U-Net, SSA-UNet improves F1-score by 10.1% and IoU by 12.1%, indicating substantial enhancement in both detection accuracy and spatial consistency.

Furthermore, compared with the attention-based CBAM-UNet, SSA-UNet still achieves a notable improvement (+3.6% F1-score), demonstrating the superiority of the proposed spectral–spatial attention design over conventional attention mechanisms.

4.3.4. Visual Assessment

(1): Visual Comparison

Qualitative results further support the quantitative findings. Compared with baseline models, traditional methods produce scattered false detections in high-reflectance regions, U-Net tends to over-detect background noise, and DeepLabV3+ improves spatial consistency but still misses small fire regions, whereas SSA-UNet produces more compact and accurate fire regions with clearer boundaries.

(2): Predictions on Validation Set Samples

To intuitively analyze the model’s predictive capability, four representative samples were randomly selected from the validation set for visualization analysis, as shown in Figure 12.

For samples containing fire, the model could accurately identify fire regions. The predicted probability maps showed high consistency with the ground truth masks, with clear fire boundaries and reasonable spatial distribution. In areas with dense fires, the model effectively distinguished adjacent fire targets, maintaining good spatial resolution. For very small fires (occupying only about a dozen pixels), the model could still generate high prediction probabilities (typically greater than 0.8), indicating strong detection capability for small-target fires.

For samples without fire, prediction probabilities for most background areas were below 0.1, indicating the model’s good discriminative ability for non-fire regions and a low false-positive rate.

4.3.5. Generalization Ability

To evaluate the application potential of the proposed SSA-UNet model in unseen spatiotemporal scenarios, this study designed cross-spatiotemporal generalization experiments. As described in Section 2.2.3, the independent test set used in this study consists of three images captured in March 2026. This dataset is entirely from a later time period than the training set in order to evaluate the model’s performance when faced with future observational data. The test set is not used at any stage of the model training or hyperparameter tuning process (including decision threshold optimization).

The generalization performance of different models on the independent test set is summarized in Table 7. All compared models exhibited varying degrees of performance degradation (decrease in F1-score) when applied to the unseen 2026 test data. This phenomenon is expected and reflects domain shift issues caused by seasonal variations, differences in surface conditions, and changes in fire burning characteristics.

However, the SSA-UNet model proposed in this study demonstrated the best robustness. Its F1-score decreased from 0.8680 on the validation set to 0.8195 on the test set, with a performance drop significantly lower than that of all comparison models. Specifically, the performance drop of SSA-UNet is approximately 57.7% smaller than that of the baseline U-Net and approximately 42.0% smaller than that of CBAM-UNet, which also incorporates an attention mechanism. These results indicate that SSA-UNet not only achieves the highest performance but also learns feature representations that exhibit stronger invariance and transferability across temporal variations.

4.4. Discussion

4.4.1. Error and Limitation Analysis

To further understand the limitations of the proposed model, a detailed error analysis is conducted by examining typical false positive (FP) and false negative (FN) cases.

(1): False-Negative Analysis (Missed Detections)

False negatives mainly occur in several challenging scenarios, as illustrated in Figure 13.

First, extremely small-scale fires occupying fewer than approximately 10 pixels are more likely to be missed because their spectral and spatial signatures are weak relative to the surrounding background. During convolution and pooling operations, these subtle fire features may become diluted or suppressed.

Second, low-intensity or smoldering fires often exhibit relatively weak thermal anomalies in the SWIR and TIR bands, resulting in insufficient spectral contrast between fire pixels and surrounding surfaces. Such fires are therefore difficult for the model to distinguish from background noise.

Third, complex environmental conditions, including cloud shadows, water-adjacent regions, and heterogeneous land surfaces, may reduce radiative contrast and interfere with fire feature extraction, thereby increasing omission errors.

(2): False-Positive Analysis (Commission Errors)

False positives are primarily associated with several types of high-temperature or high-reflectance background surfaces, as shown in Figure 14.

Typical false alarm sources include industrial areas, sunlit bare soil, desert surfaces, and cloud edges. These targets may exhibit spectral characteristics partially similar to wildfire signals, particularly in the SWIR bands, leading to misclassification by the network.

In addition, thin cirrus clouds and strong surface reflections occasionally generate localized thermal anomalies that resemble weak fire signals, further increasing the probability of commission errors under complex atmospheric conditions.

(3): Limitations: Scale Effect and Small-Target Omission

To quantitatively evaluate the influence of fire size on detection performance, fire regions were grouped according to their pixel area, and the corresponding F1-scores were calculated for each size category.

As shown in Figure 15, the detection performance exhibits a clear scale-dependent characteristic. For fire targets smaller than 10 pixels, the F1-score decreases substantially, indicating limited detection capability for extremely small fires. Detection performance improves progressively for targets ranging from 10 to 50 pixels and becomes relatively stable for larger fire regions exceeding 50 pixels.

Statistical analysis further indicates that small fires (<10 pixels) contribute to more than 47% of all false negative cases. This result confirms that small-scale fire omission remains one of the primary limitations of the current model. The main reason for this phenomenon is the feature dilution effect. Extremely small fire targets occupy only a very small proportion of the input image patch (approximately 0.01–0.1%), causing their spectral–spatial signatures to be weakened during repeated convolution and downsampling operations.

4.4.2. Case Study

To further evaluate the practical applicability of the proposed SSA-UNet framework under real wildfire conditions, a representative wildfire event that occurred on 4 April 2025 near the boundary between Qinyuan County and Pingyao County in Shanxi Province was selected as a case study.

The wildfire developed rapidly under strong wind conditions, resulting in fast fire spread across mountainous terrain and heterogeneous land-cover environments. Such conditions present considerable challenges for remote sensing-based wildfire hotspot detection because strong winds can cause fragmented fire fronts, rapid spatial expansion, and smoke interference, while mountainous terrain introduces substantial background heterogeneity and terrain-shadow effects.

The HY-1E COCTS2 image used in this case study was acquired at approximately 10:00 on 5 April 2025, nearly one day after the wildfire outbreak. At the time of satellite observation, active flame regions were difficult to visually identify in the optical imagery because of smoke coverage, reduced flame intensity, coarse spatial resolution, and complex mountainous terrain. As shown in Figure 16a, the fire-affected region is barely distinguishable from the surrounding background in the visible bands of the 500 m resolution COCTS2 optical imagery. As illustrated in Figure 16b, thermally anomalous regions remain observable despite the weak visual response in the optical observations.

To further verify the reliability of the detected wildfire hotspots, a 2 m spatial-resolution image acquired over the same region on 5 April 2025, was additionally analyzed. As shown in Figure 16c, clear burn scars and fire-affected surface features can be visually identified in the high-resolution imagery, despite being difficult to recognize in the 500 m optical observations from HY-1E COCTS2. The spatial consistency between the high-resolution burn features and the wildfire hotspots detected by SSA-UNet further demonstrates the effectiveness of the proposed framework for identifying residual wildfire activity under coarse-resolution satellite observations.

4.4.3. Practical Limitations

Although the proposed SSA-UNet achieved strong overall performance, several practical limitations remain.

Cloud contamination, smoke coverage, and atmospheric interference may obscure thermal anomalies and reduce detection reliability under complex weather conditions. In addition, the current study is based on a relatively limited number of wildfire scenes. Although extensive patch extraction and data augmentation were employed, additional multi-season and multi-regional datasets are still required to further evaluate model robustness and transferability.

Another limitation arises from the use of FIRMS products as reference labels instead of true ground observations. Since MODIS and VIIRS fire products themselves contain spatial uncertainty, temporal mismatch, and omission errors, the evaluation results should be interpreted as relative consistency with existing operational fire products rather than absolute detection accuracy.

Moreover, rapid wildfire evolution and satellite revisit limitations may introduce temporal observation gaps under extreme weather conditions. Deep learning inference over large-scale remote sensing imagery also requires considerable computational resources, which may limit operational deployment efficiency in near-real-time wildfire monitoring systems.

5. Conclusions and Future Work

5.1. Conclusions

This study develops a deep learning-based framework for fire hotspot detection using multispectral observations from the HY-1E satellite COCTS2, targeting large-scale wildfire monitoring over the Three-North region of China. This study demonstrates that HY-1E COCTS2 multispectral data, when coupled with deep learning, provide a viable solution for large-scale fire hotspot detection under complex environmental conditions. Accurate and timely wildfire hotspot detection is essential for reducing ecological degradation, protecting forest and grassland resources, mitigating carbon emissions caused by wildfires, and enhancing regional resilience in the context of global climate change, thereby contributing to sustainable development goals related to climate action and ecosystem conservation.

The main conclusions of this study are summarized as follows.

(1) Feasibility of HY-1E COCTS2 for wildfire hotspot detection

The multispectral configuration of HY-1E COCTS2, particularly the NIR, SWIR, and TIR bands, exhibits strong sensitivity to thermal anomalies associated with wildfire activity. The results confirm that COCTS2 can effectively detect moderate-to-high-intensity wildfire hotspots and provides a valuable supplementary data source for large-scale land-based wildfire monitoring applications. The capability of continuous and large-scale wildfire observation further supports sustainable management of ecological resources.

(2) Effectiveness of the proposed SSA-UNet framework

The proposed SSA-UNet framework effectively learns nonlinear spectral–spatial representations through end-to-end feature extraction and attention-based optimization, significantly outperforming conventional threshold-based approaches. The improvement is particularly evident in complex environments, where traditional methods suffer from spectral ambiguity and high false-alarm rates. By improving detection accuracy and reducing uncertainty, the proposed framework enhances the reliability of satellite-based wildfire monitoring systems and provides technical support for sustainable disaster prevention and emergency response.

(3) Practical applicability

The Shanxi Qinyuan–Pingyao wildfire case study verified the practical applicability of SSA-UNet for residual wildfire hotspot monitoring under real-world conditions. Even when wildfire signatures were difficult to visually identify in coarse-resolution optical imagery, the proposed framework could still successfully identify thermally anomalous regions associated with residual wildfire activity. This capability is beneficial for post-fire management, ecological restoration, and long-term sustainable land management, especially in ecologically fragile regions.

5.2. Future Work

Although the proposed framework achieved promising results, several challenges still require further investigation.

First, extremely small-scale and low-intensity fires remain difficult to detect because their spectral signatures are easily weakened during convolution and downsampling operations. Future studies should therefore explore more effective multi-scale feature extraction strategies, such as feature pyramid networks and adaptive attention mechanisms, to improve sensitivity to small wildfire hotspots.

Second, the current study was conducted using a relatively limited number of wildfire scenes from the Three-North region of China. Expanding high-quality annotated datasets and conducting cross-regional experiments in different ecological environments will be important for improving model robustness, transferability, and operational reliability.

Third, integrating multi-source satellite observations, including MODIS, VIIRS, and higher-resolution remote sensing imagery, may further improve temporal continuity, detection reliability, and monitoring efficiency. Future work will also investigate lightweight network architectures and near-real-time processing strategies to enhance the operational applicability of deep learning-based wildfire monitoring systems.

Author Contributions

Conceptualization, H.Z. and L.H.; methodology, Y.Z.; software, L.C.; validation, W.Z., Y.F. and Y.S.; formal analysis, H.Z.; investigation, Y.Z.; resources, Y.S.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, H.Z. and Y.S.; visualization, Y.F. and L.H.; supervision, W.Z.; project administration, L.C.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Satellite Remote Sensing Information Services (Grant No. 2025-JW34-F5001) and Integration and Application Demonstration in the Marine Field (Grant No. 0404130306).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Meng, R.; Dennison, P.E.; Huang, C.; Moritz, M.A.; D’Antonio, C. Effects of fire severity and post-fire climate on short-term vegetation recovery of mixed-conifer and red fir forests in the Sierra Nevada Mountains of California. Remote Sens. Environ. 2015, 171, 311–325. [Google Scholar] [CrossRef]
Bowman, D.M.J.S.; Balch, J.K.; Artaxo, P.; Bond, W.J.; Carlson, J.M.; Cochrane, M.A.; D’Antonio, C.M.; DeFries, R.S.; Doyle, J.C.; Harrison, S.P.; et al. Fire in the Earth system. Science 2009, 324, 481–484. [Google Scholar] [CrossRef] [PubMed]
Jolly, W.M.; Cochrane, M.A.; Freeborn, P.H.; Holden, Z.A.; Brown, T.J.; Williamson, G.J.; Bowman, D.M.J.S. Climate-induced variations in global wildfire danger from 1979 to 2013. Nat. Commun. 2015, 6, 7537. [Google Scholar] [CrossRef]
Flannigan, M.D.; Krawchuk, M.A.; de Groot, W.J.; Wotton, B.M.; Gowman, L.M. Implications of changing climate for global wildland fire. Int. J. Wildland Fire 2009, 18, 483–507. [Google Scholar] [CrossRef]
Holden, Z.A.; Swanson, A.; Luce, C.H.; Jolly, W.M.; Maneta, M.; Oyler, J.W.; Warren, D.A.; Parsons, R.; Affleck, D. Decreasing fire season precipitation increased recent western US forest wildfire activity. Proc. Natl. Acad. Sci. USA 2018, 115, E8349–E8357. [Google Scholar] [CrossRef]
Abatzoglou, J.T.; Williams, A.P. Impact of anthropogenic climate change on wildfire across western US forests. Proc. Natl. Acad. Sci. USA 2016, 113, 11770–11775. [Google Scholar] [CrossRef]
Shao, Q.; Liu, S.; Ning, J.; Liu, G.; Yang, F.; Zhang, X.; Niu, L.; Huang, H.; Fan, J.; Liu, J. Assessment of ecological benefits of key national ecological projects in China in 2000–2019 using remote sensing. Acta Geogr. Sin. 2022, 77, 2133–2153. [Google Scholar]
Vujović, F.; Gazdić, M.; Đurović, R.; Valjarević, A.; Durlević, U. Wildfire ecological risk analysis at meso-scale using medium-resolution data in protected area: A case study of Lovćen National Park, Montenegro. Open Geosci. 2026, 18, 20250917. [Google Scholar] [CrossRef]
Townshend, J.R.; Justice, C.O. Towards operational monitoring of terrestrial systems by moderate-resolution remote sensing. Remote Sens. Environ. 2002, 83, 351–359. [Google Scholar] [CrossRef]
Dozier, J. A method for satellite identification of surface temperature fields of subpixel resolution. Remote Sens. Environ. 1981, 11, 221–229. [Google Scholar] [CrossRef]
Giglio, L.; Descloitres, J.; Justice, C.O.; Kaufman, Y.J. An enhanced contextual fire detection algorithm for MODIS. Remote Sens. Environ. 2003, 87, 273–282. [Google Scholar] [CrossRef]
Schroeder, W.; Oliva, P.; Giglio, L.; Csiszar, I.A. The New VIIRS 375 m active fire detection data product: Algorithm description and initial assessment. Remote Sens. Environ. 2014, 143, 85–96. [Google Scholar] [CrossRef]
Giglio, L.; Schroeder, W.; Justice, C.O. The collection 6 MODIS active fire detection algorithm and fire products. Remote Sens. Environ. 2016, 178, 31–41. [Google Scholar] [CrossRef] [PubMed]
Wooster, M.J.; Roberts, G.; Perry, G.L.W.; Kaufman, Y.J. Retrieval of biomass combustion rates and totals from fire radiative power observations: FRP derivation and calibration relationships between biomass consumption and fire radiative energy release. J. Geophys. Res. Atmos. 2005, 110, D24311. [Google Scholar] [CrossRef]
Justice, C.O.; Giglio, L.; Korontzi, S.; Owens, J.; Morisette, J.T.; Roy, D.; Descloitres, J.; Alleaume, S.; Petitcolin, F.; Kaufman, Y. The MODIS fire products. Remote Sens. Environ. 2002, 83, 244–262. [Google Scholar] [CrossRef]
NASA FIRMS. MODIS Collection 6 NRT Hotspot/Active Fire Detections MCD14DL. Fire Information for Resource Management System (FIRMS), NASA. 2025. Available online: https://earthdata.nasa.gov/firms (accessed on 6 April 2026).
Florath, J.; Keller, S. Supervised machine learning approaches on multispectral remote sensing data for a combined detection of fire and burned area. Remote Sens. 2022, 14, 657. [Google Scholar] [CrossRef]
Davis, M.; Shekaramiz, M. Desert/forest fire detection using machine/deep learning techniques. Fire 2023, 6, 418. [Google Scholar] [CrossRef]
Hu, X.; Ban, Y.; Nascetti, A. Uni-temporal multispectral imagery for burned area mapping with deep learning. Remote Sens. 2021, 13, 1509. [Google Scholar] [CrossRef]
Zhu, Z.; Woodcock, C.E. Continuous change detection and classification of land cover using all available Landsat data. Remote Sens. Environ. 2014, 144, 152–171. [Google Scholar] [CrossRef]
Pinzon, J.E.; Tucker, C.J. A non-stationary 1981–2012 AVHRR NDVI_3g time series. Remote Sens. 2014, 6, 6929–6960. [Google Scholar] [CrossRef]
Chen, J.; Sun, L.; Xie, F.; Gao, H.; Ge, S. Research on fire detection method based on deep neural network MODIS data. Remote Sens. Technol. Appl. 2024, 39, 905–916. (In Chinese) [Google Scholar]
Kang, Y.; Sung, T.; Im, J. Toward an adaptable deep-learning model for satellite-based wildfire monitoring with consideration of environmental conditions. Remote Sens. Environ. 2023, 298, 113814. [Google Scholar] [CrossRef]
Cao, X.; Su, Y.; Geng, X.; Wang, Y. YOLO-SF: YOLO for fire segmentation detection. IEEE Access 2023, 11, 111079–111092. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, L.; Zhang, D. Swin-UNet: A hierarchical vision transformer using shifted windows for medical image segmentation. IEEE Trans. Med. Imaging 2023, 42, 2001–2014. [Google Scholar]
Li, H. Real-time forest fire detection with medium–high spatial resolution in Inner Mongolia based on the GEE platform. Guangdong Seric. 2024, 58, 34–36. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Wang, W.; Ma, C.; Wang, G.; Zhang, Y.; Tan, F.; Han, X.; Wu, L. Near Real-Time Fire Detection Based on the Combination of Traditional Methods and Deep Learning. Spacecr. Recovery Remote Sens. 2024, 45, 147–156. [Google Scholar]
Yang, T.; Xu, P.; Fu, Z.; Duan, P.; Li, S.; Hu, B.; Wang, J.; Zhang, K. System Design of a New Generation Ocean Color and Temperature Scanner. Spacecr. Recovery Remote Sens. 2025, 46, 49–62. [Google Scholar]
Ye, X.; Ma, C.; Song, Q. Chinese ocean color satellite HY-1E detected 2025 California wildfires. Acta Oceanol. Sin. 2025, 44, 1–2. [Google Scholar] [CrossRef]

Figure 1. Location map of the study area.

Figure 2. HY-1E image spatial distribution map.

Figure 3. Overall flowchart.

Figure 4. Workflow of fire label generation and sample construction.

Figure 5. Architecture of the SSA-UNet.

Figure 6. SSA-UNet fire model training loss curve.

Figure 7. Evolution of evaluation metrics during training.

Figure 8. Sensitivity analysis of the decision threshold.

Figure 9. Confusion matrix.

Figure 10. Receiver Operating Characteristic (ROC) and Precision–Recall (PR) curves.

Figure 11. Method comparison radar chart.

Figure 12. Visualization of sample predictions.

Figure 13. FN (missed detections) examples (highlighted by yellow circles): (a) Tiny fire (~5 pixels); (b) smoldering fire; (c) fire near water.

Figure 14. FP (commission errors) examples (highlighted by red circles): (a) Industrial area; (b) sunlit desert.

Figure 15. Scale effect and small-target omission analysis: (a) Detection performance vs. fire size; (b) sample distribution and false negative analysis.

Figure 16. Case study of the Shanxi Qinyuan–Pingyao “4·4” wildfire event observed on 5 April 2025: (a) HY-1E COCTS2 optical imagery (500 m spatial resolution), where wildfire signatures are difficult to visually distinguish; (b) multispectral false-color composite used in this study, showing residual thermal anomalies (highlighted by orange circles); (c) 2 m high-resolution imagery of the same region, where burn scars and active fire-affected features are clearly visible.

Table 1. Comparison of fire detection data sources.

Dataset	Key Fire Detection Bands	Center Wavelength (μm)	Bandwidth (μm)	Spatial Resolution	Temporal Resolution
MODIS	MIR/TIR	3.96/11.03	0.18/1.00	1 km	1–2 times/day
VIIRS	SWIR/TIR	3.74/11.45	0.18/0.95	375 m	1–2 times/day
COCTS2	SWIR/TIR	3.74/10.8/12.0	0.19/1.00/1.10	≤500 m	2 times/day

Table 2. Fire-detection-related bands of HY-1E COCTS2.

ID	Center Wavelength (nm)	Bandwidth (nm)	Primary Purpose
1	1245	40	Sensitive to vegetation moisture and burned surface characteristics
2	1640	80	Useful for hot surface discrimination and burn scar detection
3	3740	190	Highly sensitive to active fire thermal radiation and hotspot detection
4	10,800	1000	Used for background temperature estimation and thermal anomaly analysis
5	12,000	1100	Enhances thermal contrast and suppresses false alarms under heterogeneous surface conditions

Table 3. Image dataset.

ID	Image Name	FIRMS Point Count (for Reference)
T1	H1E_OPER_OCT_L1C_20250920T043500_20250920T044000_09661_10_R.tiff	43
T2	H1E_OPER_OCT_L1C_20250407T032000_20250407T032338_07281_10_R.tiff	762
T3	H1E_OPER_OCT_L1C_20250407T050000_20250407T050500_07287_10_R.tiff	838
T4	H1E_OPER_OCT_L1C_20250408T024500_20250408T025000_07301_10_R.tiff	451
T5	H1E_OPER_OCT_L1C_20250919T032500_20250919T033000_09652_10_R.tiff	60
T6	H1E_OPER_OCT_L1C_20250511T025000_20250511T025500_07774_10_R.tiff	1042
T7	H1E_OPER_OCT_L1C_20251024T022000_20251024T022500_10153_10_R.tiff	746
V1	H1E_OPER_OCT_L1C_20250920T025500_20250920T030000_09661_10_R.tiff	46
V2	H1E_OPER_OCT_L1C_20250405T042500_20250405T043000_07253_10_R.tiff	935
E1	H1E_OPER_OCT_L1C_20260308T054500_20260308T055000_12088_10_R.tiff	115
E2	H1E_OPER_OCT_L1C_20260305T040500_20260305T041000_12045_10_R.tiff	727
E3	H1E_OPER_OCT_L1C_20260330T033000_20260330T033500_12403_10_R.tiff	1257

Table 4. Ablation settings.

Model	Spectral Attention	Spatial Attention
U-Net	✗	✗
+Channel Attention	✓	✗
+Spatial Attention	✗	✓
SSA-UNet	✓	✓

“✓” indicates the module is included; “✗” indicates it is not included.

Table 5. Ablation results.

Method	Precision	Recall	F1-Score	IoU
U-Net	0.6945	0.9072	0.7668	0.646
+Channel Attention	0.72	0.91	0.8032	0.67
+Spatial Attention	0.71	0.915	0.7978	0.665
SSA-UNet	0.8913	0.7961	0.868	0.767

Table 6. Performance comparison of different fire detection methods.

Method	Precision	Recall	F1-Score	IoU
SWIR	0.5500	0.7500	0.6320	0.46
SCM	0.6500	0.7000	0.6730	0.52
U-Net	0.6945	0.9072	0.7668	0.646
DeepLabV3+	0.7815	0.8723	0.8242	0.701
CBAM-UNet	0.8217	0.8425	0.8319	0.715
SSA-UNet	0.8913	0.7961	0.868	0.767

Table 7. Generalization performance on the independent test set.

Method	F1-Score		Performance Drop
Method	Validation	Test	Performance Drop
U-Net	0.7668	0.6523	↓ 0.1145
DeepLabV3+	0.8242	0.7125	↓ 0.1117
CBAM-UNet	0.8319	0.7483	↓ 0.0836
SSA-UNet	0.8680	0.8195	↓ 0.0485

“↓” indicates the decrease in F1-score from the validation set to the test set (Validation − Test).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, Y.; Zhu, H.; Song, Y.; Huang, L.; Cui, L.; Zhang, W.; Fang, Y. Deep Learning-Based Fire Hotspot Detection Using HY-1E COCTS2 Data in the Three-North Region of China. Sustainability 2026, 18, 5512. https://doi.org/10.3390/su18115512

AMA Style

Zhou Y, Zhu H, Song Y, Huang L, Cui L, Zhang W, Fang Y. Deep Learning-Based Fire Hotspot Detection Using HY-1E COCTS2 Data in the Three-North Region of China. Sustainability. 2026; 18(11):5512. https://doi.org/10.3390/su18115512

Chicago/Turabian Style

Zhou, Yangyang, Haitian Zhu, Yan Song, Lei Huang, Limin Cui, Weiliang Zhang, and Yinghui Fang. 2026. "Deep Learning-Based Fire Hotspot Detection Using HY-1E COCTS2 Data in the Three-North Region of China" Sustainability 18, no. 11: 5512. https://doi.org/10.3390/su18115512

APA Style

Zhou, Y., Zhu, H., Song, Y., Huang, L., Cui, L., Zhang, W., & Fang, Y. (2026). Deep Learning-Based Fire Hotspot Detection Using HY-1E COCTS2 Data in the Three-North Region of China. Sustainability, 18(11), 5512. https://doi.org/10.3390/su18115512

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Based Fire Hotspot Detection Using HY-1E COCTS2 Data in the Three-North Region of China

Abstract

1. Introduction

2. Study Area and Data Sources

2.1. Study Area

2.2. Data Sources

2.2.1. Three-North Boundary Data

2.2.2. MODIS Fire Hotspot Data (FIRMS)

2.2.3. HY-1E Satellite Data

3. Method

3.1. Data Preprocessing

3.2. Label Generation and Sample Construction

3.3. Fire Hotspot Detection Model: SSA-UNet

3.3.1. Network Architecture

3.3.2. Spectral–Spatial Attention Mechanism

3.3.3. Loss Function and Optimization

3.4. Fire Hotspot Extraction and Post-Processing

3.5. Evaluation Metrics

4. Results and Analysis

4.1. Training Dynamics

4.1.1. Loss Convergence Behavior

4.1.2. Metric Evolution

4.1.3. Learning Rate Adaptation

4.2. Quantitative Performance Evaluation

4.2.1. Threshold Optimization

4.2.2. Confusion Matrix Analysis

4.2.3. ROC and PR Curve Analysis

4.3. Comparative Evaluation

4.3.1. Experimental Setup

4.3.2. Ablation Study

4.3.3. Quantitative Comparison

4.3.4. Visual Assessment

4.3.5. Generalization Ability

4.4. Discussion

4.4.1. Error and Limitation Analysis

4.4.2. Case Study

4.4.3. Practical Limitations

5. Conclusions and Future Work

5.1. Conclusions

5.2. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI