1. Introduction
Wildfires are among the most destructive natural hazards, causing severe ecological degradation, atmospheric pollution, biodiversity loss, and substantial socio-economic damage [
1,
2]. In recent decades, global wildfire activity has exhibited a clear trend toward increasing frequency, intensity, and spatial extent under the combined influence of climate change and extreme weather events [
3]. This escalating threat poses a direct challenge to achieving long-term environmental sustainability and ecological security goals worldwide. Accurate and timely fire hotspot detection has therefore become more than a technical issue; it is a critical requirement for supporting sustainable development through ecosystem conservation, climate adaptation, disaster mitigation, and community resilience enhancement.
Previous studies have demonstrated that prolonged droughts, persistent heatwaves, low relative humidity, and strong wind conditions significantly increase vegetation flammability and accelerate fire spread [
4,
5]. Bowman et al. [
2] emphasized that climate-driven changes in temperature and precipitation regimes are fundamentally reshaping global fire dynamics, while Abatzoglou and Williams [
6] reported that anthropogenic warming has substantially intensified wildfire occurrence by increasing fuel aridity. Particularly in arid and semi-arid regions, elevated surface temperatures and reduced soil moisture further enhance ignition probability and fire propagation [
7]. Effectively monitoring and managing wildfire risks in these vulnerable regions is integral to sustainable land management and climate adaptation strategies. Furthermore, a meso-scale risk analysis in Lovcen National Park, Montenegro, demonstrated the utility of medium-resolution satellite data in quantifying wildfire vulnerability by integrating climatic, vegetative, and topographic factors [
8]. These studies collectively highlight the importance of accurate and timely fire hotspot detection for wildfire early warning, emergency response, and post-fire assessment within sustainable disaster risk reduction frameworks.
Satellite remote sensing has become the primary approach for large-scale and continuous wildfire monitoring because of its wide spatial coverage and frequent revisit capability [
9]. From a physical perspective, fire hotspot detection is fundamentally based on the enhanced radiative emission of high-temperature targets, especially in the shortwave infrared (SWIR) and thermal infrared (TIR) spectral regions. The classical dual-band sub-pixel fire model proposed by Dozier [
10] established the theoretical foundation for estimating fire temperature and sub-pixel burning area from satellite observations. Building upon this framework, contextual thermal anomaly algorithms were subsequently developed and widely implemented in operational fire products derived from MODIS and VIIRS sensors [
11,
12,
13,
14,
15]. Among them, the NASA Fire Information for Resource Management System (FIRMS) provides globally consistent active fire products with long-term operational stability and has become one of the most widely used fire monitoring databases worldwide [
16]. The continuous advancement of satellite-based fire monitoring technologies plays an increasingly important role in supporting sustainable forest management, ecological protection, and climate governance.
Although MODIS- and VIIRS-based fire products have demonstrated considerable success, conventional threshold and contextual algorithms still face several limitations. Under moderate spatial resolution, fire pixels are often mixed with surrounding background components, leading to reduced separability between fire signals and non-fire surfaces [
17]. This issue becomes particularly severe in heterogeneous environments such as deserts, bare soil regions, urban surfaces, and cloud-contaminated areas, where spectrally similar high-temperature backgrounds can substantially increase false alarms [
18]. Consequently, wildfire detection is not merely a thermal anomaly extraction problem but also a complex spectral–spatial discrimination task under varying environmental conditions.
To improve fire detection capability, recent studies have increasingly explored multispectral remote sensing approaches by jointly utilizing visible, near-infrared (NIR), shortwave infrared (SWIR), and thermal infrared (TIR) observations [
19,
20]. Multispectral observations can simultaneously characterize vegetation structure, surface moisture conditions, and thermal anomalies, thereby providing more comprehensive information for fire-background discrimination [
21]. In parallel, machine learning techniques such as Random Forest (RF) and Support Vector Machine (SVM) have been introduced to construct multidimensional feature spaces for fire detection [
22]. More recently, deep learning methods have demonstrated superior capability in modeling complex nonlinear spectral–spatial relationships within remote sensing imagery [
18,
23]. Convolutional neural networks (CNNs), fully convolutional networks (FCNs), and transformer-based architectures have shown promising performance in wildfire mapping and burned-area segmentation tasks [
24,
25,
26]. Encoder–decoder frameworks such as U-Net [
27] are particularly effective for pixel-level prediction because of their ability to integrate multi-scale semantic and spatial features. These advances provide a pathway toward more intelligent and resilient wildfire monitoring capabilities that can support sustainability goals through data-driven approaches.
To further enhance feature representation, attention mechanisms have been widely incorporated into deep learning models. Channel attention and spatial attention modules enable adaptive emphasis on informative spectral bands and spatial regions, thereby improving the discrimination of target features under complex backgrounds [
28,
29,
30]. The Convolutional Block Attention Module (CBAM) [
28] and dual attention mechanisms [
29] have achieved strong performance in remote sensing scene analysis and semantic segmentation tasks.
The Chinese HY-1E satellite has provided a new opportunity for multispectral wildfire monitoring. HY-1E carries the second Chinese Ocean Color and Temperature Scanner (COCTS2), which provides multispectral observations spanning visible, NIR, SWIR, and thermal infrared wavelengths [
31]. A comparison of the principal characteristics of commonly used fire-monitoring datasets is presented in
Table 1. Although originally designed for marine observation, COCTS2 provides multispectral observations ranging from visible to thermal infrared bands with a spatial resolution suitable for terrestrial thermal anomaly detection. Compared with traditional ocean color sensors, COCTS2 exhibits enhanced radiometric sensitivity and improved thermal infrared capability, making it potentially valuable for wildfire hotspot monitoring. Preliminary studies have already demonstrated the feasibility of HY-1E observations for detecting large wildfire events [
32]. Exploring and validating new data sources like HY-1E COCTS2 for terrestrial applications aligns with the principle of promoting innovation and infrastructure for sustainability, expanding the available resources for global environmental stewardship.
Despite recent advances in multispectral fire detection, several important challenges remain unresolved. First, most traditional threshold-based methods rely on handcrafted rules and fixed empirical parameters, limiting their robustness under heterogeneous environmental conditions. Second, the application potential of ocean color satellites for terrestrial wildfire monitoring remains largely unexplored, particularly for HY-1E COCTS2.
To address these limitations, this study proposes a deep learning-based fire hotspot detection framework using HY-1E COCTS2 multispectral observations over the Three-North region of China. Specifically, a Spectral–Spatial Attention U-Net (SSA-UNet) is developed to jointly model spectral dependencies and spatial contextual information for robust fire hotspot extraction under complex background conditions. The proposed framework aims to improve the reliability and operational applicability of wildfire monitoring in ecologically sensitive regions. Enhanced wildfire detection capability can support the protection of critical ecosystems such as the Three-North Shelterbelt Project, which plays a vital role in combating desertification, preserving ecological stability, and promoting sustainable development in northern China.
3. Method
This study develops a Spectral–Spatial Attention U-Net (SSA-UNet) by fully leveraging the multispectral observation capability of HY-1E COCTS2 data and integrating adaptive spectral weighting and spatial context modeling into the classic U-Net architecture. As shown in
Figure 3, the proposed framework consists of four major stages: sample construction with label generation, model training based on the attention-enhanced architecture, and fire hotspot extraction combined with post-processing.
First, the original L1C multispectral imagery from COCTS2 is preprocessed, including band selection, radiometric normalization, and spatial reference unification, to ensure spectral consistency and spatial alignment across multi-temporal images. Next, using FIRMS fire point data spatially and temporally matched to HY-1E imagery, binary label masks are generated for model training, and a sliding-window strategy is adopted to construct training samples. Subsequently, an SSA-UNet model is trained to perform pixel-level fire detection by jointly learning spectral and spatial features. Finally, the trained model is applied to full-image imagery using a sliding-window inference approach, followed by post-processing to extract fire hotspot regions and their spatial attributes.
3.1. Data Preprocessing
To ensure the consistency and usability of multispectral remote sensing data, systematic preprocessing is applied to the Level-1C imagery from the HY-1E satellite COCTS2.
- (1)
Band selection
Based on the spectral configuration of COCTS2 and the radiative characteristics of high-temperature targets, five bands are selected to construct the multispectral input dataset. These bands cover the near-infrared (NIR) and shortwave infrared (SWIR) regions, which are sensitive to high-temperature radiation and surface moisture variation.
- (2)
Radiometric normalization
To reduce radiometric discrepancies among different HY-1E COCTS2 scenes and improve the numerical stability of deep learning training, all spectral bands were normalized using standard score normalization (z-score normalization):
where X denotes the original radiance, μ and σ represent the mean and standard deviation of the image, respectively, and ϵ = 1 × 10
−6 is a small constant to avoid division by zero.
The z-score normalization method was selected instead of other commonly used approaches such as min–max scaling because the multispectral bands of HY-1E COCTS2 exhibit substantially different radiometric ranges and statistical distributions. Standardization to zero mean and unit variance reduces inter-band scale discrepancies and prevents high-value thermal bands from dominating the optimization process during network training.
In addition, wildfire hotspot detection typically involves strong thermal anomalies and highly heterogeneous background conditions, including deserts, bare soil, smoke, and cloud-contaminated regions. Compared with min–max normalization, z-score normalization is less sensitive to extreme radiance values caused by saturated fire pixels or cloud edges, thereby preserving the statistical structure of the majority of background pixels.
The standardized inputs also improve convergence stability and facilitate more effective feature learning within the proposed spectral–spatial attention framework of SSA-UNet.
- (3)
Spatial reference unification
To ensure spatial consistency across multi-temporal imagery, all datasets are reprojected into the WGS84 UTM coordinate system.
3.2. Label Generation and Sample Construction
The NASA FIRMS fire point product is used as the reference for generating training labels and conducting evaluations. For each HY-1E COCTS2 image, all MODIS and VIIRS fire detections within a ±6 h temporal window relative to the satellite overpass are selected to ensure temporal consistency while accounting for potential fire dynamics.
Since FIRMS provides point-based detections rather than pixel-level fire perimeters, it cannot be directly used as segmentation ground truth. Therefore, a semi-manual annotation strategy is adopted to generate training labels. Specifically, FIRMS fire points are first used as spatial anchors to locate candidate fire regions. Then, fire boundaries are delineated through visual interpretation of multispectral imagery, guided by the following strict criteria:
- (1)
Significant radiance enhancement in SWIR and TIR bands;
- (2)
Spatial continuity and morphological consistency of high-temperature regions;
- (3)
Temporal agreement with FIRMS detections;
- (4)
Exclusion of ambiguous regions (e.g., cloud edges, bright bare soil, or industrial heat sources).
This strategy ensures that the generated fire polygons represent physically meaningful fire-affected areas while minimizing subjectivity and labeling noise. The resulting vector polygons are rasterized to produce binary masks for supervised learning. The workflow of training label generation and sample construction is summarized in
Figure 4, which demonstrates how FIRMS fire points are used as spatial–temporal anchors and further refined through multispectral visual interpretation and strict boundary delineation criteria.
To construct the training dataset, a sliding-window sampling strategy is applied. Multispectral image patches of size 128 × 128 pixels are extracted with a stride of 64 pixels, resulting in 50% overlap between adjacent patches. This configuration preserves spatial context and increases sample diversity. Each image patch is paired with its corresponding binary mask.
Due to the extreme class imbalance inherent in fire detection tasks, where fire pixels typically account for less than 0.01% of the total area, negative samples are randomly down-sampled to maintain a controlled positive-to-negative ratio of approximately 1:2. In total, 3240 samples are generated, including 1080 fire-containing patches and 2160 non-fire patches.
The dataset consists of 12 HY-1E COCTS2 images. Among them, 9 images acquired in 2025 are used for training and validation (split ratio 8:2), while 3 images acquired in March 2026 are reserved as a completely independent test set. The test dataset is strictly separated in both temporal and spatial domains, ensuring no overlap with training data and providing a robust assessment of model generalization capability. Detailed information on the selected HY-1E COCTS2 images and corresponding FIRMS fire point counts is summarized in
Table 3.
To improve model generalization and robustness under limited training data, data augmentation techniques were applied during training. Specifically, each training patch was randomly augmented using the following transformations:
- (1)
Random horizontal and vertical flipping;
- (2)
Random rotation (0°, 90°, 180°, 270°);
- (3)
Random brightness and contrast adjustment (±10%);
- (4)
Gaussian noise perturbation with low variance.
These augmentations were applied on-the-fly during training to increase sample diversity and reduce overfitting, particularly under severe class imbalance conditions.
3.3. Fire Hotspot Detection Model: SSA-UNet
To improve feature representation for fire detection, a spectral–spatial attention-enhanced U-Net (SSA-UNet) is proposed.
3.3.1. Network Architecture
The network takes multispectral image patches (5 bands, 128 × 128 pixels) as input and outputs a probability map with the same spatial resolution, where each pixel represents the likelihood of a fire hotspot. As shown in
Figure 5, the proposed network adopts a typical encoder–decoder architecture composed of three major components: an encoder, a bottleneck layer, and a decoder.
The encoder (downsampling path) contains four convolutional blocks. Each block consists of two 3 × 3 convolutional layers, followed by Batch Normalization and ReLU activation, and a 2 × 2 max-pooling layer. The number of feature channels increases progressively from 32 to 256, while the spatial resolution is reduced by a factor of 2 at each stage. This structure enables hierarchical extraction of multi-scale semantic features.
The bottleneck layer contains two 3 × 3 convolutional layers with 512 channels, capturing high-level semantic representations of fire hotspots.
The decoder (upsampling path) is symmetric to the encoder and consists of four upsampling blocks. Each block includes a 2 × 2 transposed convolution for upsampling, followed by concatenation with the corresponding encoder features via skip connections, and two 3 × 3 convolutional layers with Batch Normalization and ReLU activation. The number of channels is gradually reduced from 256 to 32, enabling precise spatial reconstruction.
Output layer. A 1 × 1 convolution is applied to map the feature representation to a single-channel probability map, followed by a Sigmoid activation function:
where P denotes the predicted probability of a fire hotspot at pixel (x,y).
3.3.2. Spectral–Spatial Attention Mechanism
To enhance the discriminative capability of the model, a spectral–spatial attention (SSA) module is introduced.
- (1)
Spectral Attention (Channel Attention)
In the spectral attention module, channel-wise attention is implemented using a lightweight multilayer perceptron (MLP). Specifically, both global average pooling and global max pooling are applied to extract complementary channel-wise descriptors. The average pooling operation captures the global distribution of feature responses, while max pooling emphasizes the most salient activations, enabling the aggregation of diverse statistical characteristics.
These descriptors are then passed through a shared two-layer fully connected network with a reduction ratio of r = 8, where the hidden layer dimension is C/r and C denotes the number of input channels. A ReLU activation function is applied between the two layers, followed by a sigmoid function to generate normalized channel attention weights.
where Mc denotes channel attention weights.
- (2)
Spatial Attention
In the spatial attention module, the operation f(7×7) is implemented as a convolutional layer with a kernel size of 7 × 7, followed by Batch Normalization and a Sigmoid activation function. The input feature maps are first aggregated using channel-wise average pooling and max pooling, and then passed through the convolution layer to produce spatial attention maps.
Spatial attention focuses on the spatial distribution of fire regions:
where Ms highlights fire-related spatial regions.
- (3)
SSA Fusion
The final SSA module combines both mechanisms:
This design enables the model to simultaneously emphasize fire-sensitive spectral bands and capture spatial context of fire distribution.
- (4)
Integration into U-Net
The SSA modules are embedded into convolutional blocks in both encoder and decoder stages, allowing multi-scale attention learning.
Additionally, an input-level channel attention module is applied to refine raw spectral features.
3.3.3. Loss Function and Optimization
Due to the extremely low proportion of fire hotspot pixels (approximately 0.01%), the task suffers from severe class imbalance. To address this issue, a hybrid loss function combining Binary Cross-Entropy (BCE) loss and Dice loss is adopted:
The BCE loss is defined as:
The Dice loss is defined as:
where yi and pi denote the ground truth and predicted probability of pixel i, respectively, and ϵ is a small constant to avoid numerical instability. This combined loss function simultaneously optimizes pixel-wise classification accuracy and region-level overlap, effectively mitigating class imbalance.
The model is trained using the Adam optimizer with an initial learning rate of 1 × 10−4, a batch size of 16, and 50 epochs. A ReduceLROnPlateau strategy is employed to adaptively adjust the learning rate when the validation F1-score does not improve for five consecutive epochs, enhancing convergence stability.
3.4. Fire Hotspot Extraction and Post-Processing
The trained model is applied to full-image imagery to generate fire hotspot detection results through a sliding-window inference strategy.
First, image patches of 128 × 128 pixels are extracted with a stride of 64 pixels, and each patch is processed to obtain a probability map. Overlapping regions are averaged to produce a continuous probability map P(x,y).
A threshold-based segmentation is then applied:
Instead of using an empirically fixed threshold, the decision threshold is determined through validation-based optimization. Specifically, a threshold sensitivity analysis is conducted, and the optimal threshold is selected by maximizing the F1-score. Subsequently, morphological opening and closing operations are performed to remove noise and fill small holes. Connected component analysis is applied to identify individual fire hotspot regions, and regions smaller than three pixels are removed as false detections. Finally, the binary mask is converted into vector polygons, from which spatial attributes including centroid, area, and mean probability are derived.
3.5. Evaluation Metrics
To quantitatively evaluate the performance of fire hotspot detection, a buffer-based spatial matching strategy is adopted to address the inherent mismatch between pixel-level predictions and point-based reference data. To ensure statistical robustness, all performance metrics are computed over the full test set, and spatial–temporal uncertainties in reference fire products are explicitly considered through buffer-based matching and temporal filtering strategies.
The reference fire product provides geolocated fire detections with known spatial uncertainties, typically ranging from 375 m (VIIRS) to 1 km (MODIS). Direct pixel-wise comparison between model outputs and FIRMS points is therefore not physically meaningful. To ensure a fair and scientifically sound evaluation, each FIRMS detection is expanded into a circular buffer with a radius of 1 km, accounting for geolocation uncertainty and spatial resolution differences.
A predicted fire pixel is considered a True Positive (TP) if it falls within any FIRMS buffer zone. Predictions outside all buffer zones are counted as False Positives (FP), while FIRMS buffer zones without any corresponding predictions are treated as False Negatives (FN).
Intersection over Union (IoU)
Precision measures the reliability of detected fire hotspots, while Recall reflects the detection completeness. The F1-score provides a balanced evaluation of both metrics, and IoU quantifies the spatial overlap between predicted and ground truth masks.
Since the FIRMS product itself is not an absolute truth and has its own detection limits and errors, the above metrics reflect the consistency of the proposed method with this specific operational reference product, rather than absolute detection accuracy.
4. Results and Analysis
All experiments were conducted on a workstation equipped with an Intel Core i7-12700H CPU and 32 GB RAM under a unified experimental framework to ensure fair and reproducible evaluation. Model training and quantitative analyses were implemented using Python 3.9 and the PyTorch 2.0.1, and CUDA 11.8. Numerical calculations and statistical analyses were performed using NumPy 1.24.4, SciPy 1.11.4. Image processing and mask refinement relied on scikit-image 0.22.0. Geospatial data processing was conducted using GDAL 3.4.3 and Rasterio 1.3.11, while visualization and figure generation were carried out using Matplotlib 3.7.2 and QGIS 3.34.
4.1. Training Dynamics
The model was trained using the Adam optimizer with default momentum parameters β1 = 0.9 and β2 = 0.999, and ε = 1 × 10−8. A weight decay of 1 × 10−5 was applied to reduce overfitting.
The initial learning rate was set to 1 × 10−4. A ReduceLROnPlateau learning rate scheduler was employed to adaptively adjust the learning rate based on validation performance, with a reduction factor of 0.5, patience of 5 epochs, and a minimum learning rate of 1 × 10−6.
The model was trained for 50 epochs with a batch size of 16. Early stopping was not applied, as convergence was consistently observed within the predefined number of epochs.
4.1.1. Loss Convergence Behavior
The training process demonstrates a stable and well-converged optimization trajectory throughout the entire training period. As shown in
Figure 6, both the training loss and validation loss decrease consistently with increasing epochs, indicating effective learning of wildfire-related spectral–spatial features by the proposed SSA-UNet model.
Rapid learning stage (Epoch 1–10): The model quickly captures discriminative spectral–spatial patterns of fire hotspots, leading to a sharp decrease in both training and validation losses.
Progressive optimization stage (Epoch 10–30): The loss continues to decline at a slower rate, indicating refinement of fine-grained features such as local texture and contextual contrast.
Convergence stage (Epoch 30–50): The validation loss stabilizes around its minimum (0.2778), with a consistently small gap between training and validation curves, suggesting good generalization and absence of overfitting.
Notably, validation loss is occasionally lower than training loss, which can be attributed to implicit regularization effects such as Batch Normalization and data augmentation.
4.1.2. Metric Evolution
Performance metrics, including Precision, Recall, F1-score, and Intersection over Union (IoU), exhibit consistent improvement throughout the training process. As illustrated in
Figure 7, all evaluation metrics increase steadily with increasing epochs.
During early training, Recall increases rapidly, indicating that the model prioritizes detecting potential fire hotspots. As training progresses, Precision gradually improves, reflecting enhanced suppression of false positives.
In later stages, all metrics stabilize, with F1-score reaching ~0.88 and IoU ~0.77, demonstrating a balanced trade-off between detection sensitivity and reliability.
This dynamic reflects a typical precision–recall trade-off optimization process, where the model evolves from coarse detection to refined discrimination.
4.1.3. Learning Rate Adaptation
The adoption of the ReduceLROnPlateau strategy plays a critical role in stabilizing training. As performance improvement slows in the mid-to-late stages, the learning rate is adaptively reduced, allowing finer parameter updates and preventing oscillations around local minima. This contributes to the smooth convergence observed in the final training phase.
4.2. Quantitative Performance Evaluation
4.2.1. Threshold Optimization
The prediction outputs of the SSA-UNet model are continuous probability values ranging from 0 to 1, where each pixel value represents the estimated likelihood of wildfire hotspot occurrence. Rather than adopting a fixed empirical threshold, this study systematically evaluated model performance under multiple threshold settings to determine the optimal segmentation criterion for fire hotspot extraction. As illustrated in
Figure 8, threshold values ranging from 0.1 to 0.9 with an interval of 0.1 were tested to analyze the sensitivity of the model to different probability cutoffs. Lower threshold values generally produce higher Recall because more potential fire pixels are retained; however, they also increase the risk of false-positive detections caused by background interference. Conversely, higher thresholds improve Precision by suppressing weak and uncertain predictions but may lead to omission of small or low-intensity fire hotspots.
The results reveal a clear trend: as the threshold increases, Precision gradually improves due to the suppression of false positives, while Recall decreases as more low-confidence fire pixels are discarded. Consequently, the F1-score exhibits a unimodal distribution, reaching its maximum at an intermediate threshold.
A threshold sensitivity analysis reveals that model performance is strongly influenced by the decision threshold. As the threshold increases, Precision improves while Recall decreases, reflecting stricter classification criteria. The optimal balance is achieved at T = 0.40, where F1-score reaches its maximum.
4.2.2. Confusion Matrix Analysis
As illustrated in
Figure 9, among 10,614,832 evaluated pixels, 200,435 fire pixels are correctly detected (TP), while 31,113 pixels are falsely classified as fire (FP), and 29,825 fire pixels are missed (FN).
The relatively low number of false positives demonstrates strong background suppression capability, which is essential in complex environments such as deserts and cloud-contaminated regions. Meanwhile, the remaining false negatives are primarily associated with small-scale or low-intensity fires, highlighting the inherent difficulty of detecting weak thermal signals under moderate spatial resolution.
4.2.3. ROC and PR Curve Analysis
As shown in
Figure 10a, the model achieves an Area Under the Receiver Operating Characteristic (ROC) Curve value of 0.998, indicating excellent separability between fire and non-fire classes across different classification thresholds. However, because wildfire hotspot detection is characterized by severe class imbalance, where fire pixels occupy only a very small proportion of the total image, ROC curves alone may overestimate practical detection performance. Therefore, the Precision–Recall (PR) curve was additionally evaluated to provide a more reliable assessment under sparse-target conditions. As illustrated in
Figure 10b, the PR curve achieves an AUC value of 0.946, demonstrating strong robustness in identifying sparse wildfire hotspot targets while maintaining relatively high precision.
4.3. Comparative Evaluation
To comprehensively evaluate the effectiveness of the proposed method, we conducted comparative experiments against both conventional approaches and state-of-the-art deep learning models. To ensure a fair and rigorous comparison, all baseline methods were carefully optimized under a consistent experimental framework.
For traditional methods, key parameters (e.g., SWIR threshold, window size, and sensitivity coefficient) were systematically tuned using grid search on the validation set to maximize their respective F1-scores. This ensures that the performance of conventional approaches reflects their optimal capability rather than suboptimal parameter settings.
For deep learning models, identical training datasets, preprocessing procedures, network input sizes, loss functions, and optimization strategies were adopted.
4.3.1. Experimental Setup
Two traditional methods were selected as baseline approaches:
- (1)
SWIR threshold-based method
Fires typically exhibit significantly high brightness in the shortwave infrared band; therefore, high-temperature pixels can be identified by setting a threshold. This method uses the B13 (1640 nm) band of the HY-1E image for fire discrimination:
When a pixel’s brightness value exceeds this threshold, it is classified as a fire pixel. Similar single-band threshold strategies have been widely used in early AVHRR and MODIS fire detection algorithms [
26].
- (2)
Spatial context-based method (SCM)
To reduce the false alarm rate of the single-band threshold method, a spatial context-based approach is adopted. It follows the general framework of contextual fire detection algorithms widely used in satellite-based fire products, particularly those developed for MODIS and AVHRR sensors [
8,
27].
Unlike pixel-wise thresholding, this approach incorporates local neighborhood statistics to evaluate the relative anomaly of a candidate pixel with respect to its surrounding background. For each candidate pixel i, a local window Ωi of size w × w (typically 3 × 3 or 5 × 5) is defined. The local background statistics excluding the central pixel are computed as:
where μi and σi denote the local mean and standard deviation, respectively.
A pixel is classified as fire if it satisfies both a global intensity constraint and a local anomaly condition:
where k is a sensitivity parameter (typically ranging from 2 to 4).
This contextual strategy ensures that detected fire pixels are not only spectrally bright but also significantly different from their local background, thereby effectively reducing commission errors caused by spatially homogeneous bright surfaces.
In addition, two widely used deep learning segmentation models were adopted for comparison: (3) U-Net and (4) DeepLabV3+.
To further investigate the effectiveness of attention mechanisms, an attention-enhanced baseline (CBAM-UNet) was also included.
To ensure a fair comparison, all deep learning models were trained under identical conditions. Specifically:, the same training and validation datasets were used, identical preprocessing procedures were applied, the input patch size was fixed at 128 × 128 pixels, the loss function was uniformly set to BCE + Dice loss, the Adam optimizer with an initial learning rate of 1 × 10−4 was used, and all models were trained for 50 epochs with a batch size of 16. This unified experimental protocol ensures that performance differences arise from model design rather than training strategies.
4.3.2. Ablation Study
To quantitatively evaluate the contribution of each attention component, an ablation study was conducted using four network configurations with different combinations of spectral attention and spatial attention modules. As summarized in
Table 4, the experiments include the baseline U-Net, U-Net with spectral (channel) attention, U-Net with spatial attention, and the SSA-UNet integrating both attention mechanisms.
The quantitative results presented in
Table 5 demonstrate that both attention modules contribute positively to wildfire hotspot detection performance, although their effects differ in terms of feature representation behavior. Compared with the baseline U-Net, the introduction of spectral attention improves Precision. This improvement indicates that spectral attention effectively enhances the selection of fire-sensitive spectral channels, particularly SWIR and thermal infrared bands, thereby suppressing false-positive responses caused by spectrally similar background regions. In contrast, the incorporation of spatial attention produces a relatively larger improvement in Recall. This result suggests that spatial attention enhances the network’s ability to capture contextual spatial dependencies and identify fragmented or weak wildfire hotspots under heterogeneous environmental conditions.
4.3.3. Quantitative Comparison
The quantitative comparison results on the validation dataset are presented in
Table 6 and
Figure 11.
The results demonstrate that traditional methods achieve relatively high Recall but suffer from low Precision due to their sensitivity to background noise and spectral ambiguity. Parameter Sensitivity Analysis of Traditional Methods: When applying traditional methods (the SWIR thresholding method and the spatial context method), we observed that their performance is highly sensitive to parameter settings (such as the threshold T_SWIR, local window size w, and sensitivity parameter k). To ensure a fair comparison with deep learning methods, we optimized these parameters on the validation set via grid search to maximize their respective F1-scores. The optimized spatial context method (w = 5, k = 3) achieved an absolute improvement of approximately 18% in precision (from 0.5500 to 0.6500) compared to the simple SWIR thresholding method, confirming that the introduction of local spatial statistics effectively suppresses false positives caused by bright backgrounds. Nevertheless, the performance ceiling of these two traditional methods remains significantly lower than that of deep learning models. This highlights that, on novel sensor data such as HY-1E COCTS2, models relying on fixed thresholds and simple statistical rules struggle to capture the nonlinear, multiscale spectral–spatial patterns between fires and complex backgrounds, whereas data-driven deep learning methods possess inherent advantages in this regard.
Deep learning models significantly improve overall performance by learning complex spectral–spatial patterns. Among them, DeepLabV3+ achieves strong performance due to its ability to capture multi-scale contextual information.
The proposed SSA-UNet outperforms all comparison methods, achieving the highest F1-score and IoU. Compared with the baseline U-Net, SSA-UNet improves F1-score by 10.1% and IoU by 12.1%, indicating substantial enhancement in both detection accuracy and spatial consistency.
Furthermore, compared with the attention-based CBAM-UNet, SSA-UNet still achieves a notable improvement (+3.6% F1-score), demonstrating the superiority of the proposed spectral–spatial attention design over conventional attention mechanisms.
4.3.4. Visual Assessment
- (1)
Visual Comparison
Qualitative results further support the quantitative findings. Compared with baseline models, traditional methods produce scattered false detections in high-reflectance regions, U-Net tends to over-detect background noise, and DeepLabV3+ improves spatial consistency but still misses small fire regions, whereas SSA-UNet produces more compact and accurate fire regions with clearer boundaries.
- (2)
Predictions on Validation Set Samples
To intuitively analyze the model’s predictive capability, four representative samples were randomly selected from the validation set for visualization analysis, as shown in
Figure 12.
For samples containing fire, the model could accurately identify fire regions. The predicted probability maps showed high consistency with the ground truth masks, with clear fire boundaries and reasonable spatial distribution. In areas with dense fires, the model effectively distinguished adjacent fire targets, maintaining good spatial resolution. For very small fires (occupying only about a dozen pixels), the model could still generate high prediction probabilities (typically greater than 0.8), indicating strong detection capability for small-target fires.
For samples without fire, prediction probabilities for most background areas were below 0.1, indicating the model’s good discriminative ability for non-fire regions and a low false-positive rate.
4.3.5. Generalization Ability
To evaluate the application potential of the proposed SSA-UNet model in unseen spatiotemporal scenarios, this study designed cross-spatiotemporal generalization experiments. As described in
Section 2.2.3, the independent test set used in this study consists of three images captured in March 2026. This dataset is entirely from a later time period than the training set in order to evaluate the model’s performance when faced with future observational data. The test set is not used at any stage of the model training or hyperparameter tuning process (including decision threshold optimization).
The generalization performance of different models on the independent test set is summarized in
Table 7. All compared models exhibited varying degrees of performance degradation (decrease in F1-score) when applied to the unseen 2026 test data. This phenomenon is expected and reflects domain shift issues caused by seasonal variations, differences in surface conditions, and changes in fire burning characteristics.
However, the SSA-UNet model proposed in this study demonstrated the best robustness. Its F1-score decreased from 0.8680 on the validation set to 0.8195 on the test set, with a performance drop significantly lower than that of all comparison models. Specifically, the performance drop of SSA-UNet is approximately 57.7% smaller than that of the baseline U-Net and approximately 42.0% smaller than that of CBAM-UNet, which also incorporates an attention mechanism. These results indicate that SSA-UNet not only achieves the highest performance but also learns feature representations that exhibit stronger invariance and transferability across temporal variations.
4.4. Discussion
4.4.1. Error and Limitation Analysis
To further understand the limitations of the proposed model, a detailed error analysis is conducted by examining typical false positive (FP) and false negative (FN) cases.
- (1)
False-Negative Analysis (Missed Detections)
False negatives mainly occur in several challenging scenarios, as illustrated in
Figure 13.
First, extremely small-scale fires occupying fewer than approximately 10 pixels are more likely to be missed because their spectral and spatial signatures are weak relative to the surrounding background. During convolution and pooling operations, these subtle fire features may become diluted or suppressed.
Second, low-intensity or smoldering fires often exhibit relatively weak thermal anomalies in the SWIR and TIR bands, resulting in insufficient spectral contrast between fire pixels and surrounding surfaces. Such fires are therefore difficult for the model to distinguish from background noise.
Third, complex environmental conditions, including cloud shadows, water-adjacent regions, and heterogeneous land surfaces, may reduce radiative contrast and interfere with fire feature extraction, thereby increasing omission errors.
- (2)
False-Positive Analysis (Commission Errors)
False positives are primarily associated with several types of high-temperature or high-reflectance background surfaces, as shown in
Figure 14.
Typical false alarm sources include industrial areas, sunlit bare soil, desert surfaces, and cloud edges. These targets may exhibit spectral characteristics partially similar to wildfire signals, particularly in the SWIR bands, leading to misclassification by the network.
In addition, thin cirrus clouds and strong surface reflections occasionally generate localized thermal anomalies that resemble weak fire signals, further increasing the probability of commission errors under complex atmospheric conditions.
- (3)
Limitations: Scale Effect and Small-Target Omission
To quantitatively evaluate the influence of fire size on detection performance, fire regions were grouped according to their pixel area, and the corresponding F1-scores were calculated for each size category.
As shown in
Figure 15, the detection performance exhibits a clear scale-dependent characteristic. For fire targets smaller than 10 pixels, the F1-score decreases substantially, indicating limited detection capability for extremely small fires. Detection performance improves progressively for targets ranging from 10 to 50 pixels and becomes relatively stable for larger fire regions exceeding 50 pixels.
Statistical analysis further indicates that small fires (<10 pixels) contribute to more than 47% of all false negative cases. This result confirms that small-scale fire omission remains one of the primary limitations of the current model. The main reason for this phenomenon is the feature dilution effect. Extremely small fire targets occupy only a very small proportion of the input image patch (approximately 0.01–0.1%), causing their spectral–spatial signatures to be weakened during repeated convolution and downsampling operations.
4.4.2. Case Study
To further evaluate the practical applicability of the proposed SSA-UNet framework under real wildfire conditions, a representative wildfire event that occurred on 4 April 2025 near the boundary between Qinyuan County and Pingyao County in Shanxi Province was selected as a case study.
The wildfire developed rapidly under strong wind conditions, resulting in fast fire spread across mountainous terrain and heterogeneous land-cover environments. Such conditions present considerable challenges for remote sensing-based wildfire hotspot detection because strong winds can cause fragmented fire fronts, rapid spatial expansion, and smoke interference, while mountainous terrain introduces substantial background heterogeneity and terrain-shadow effects.
The HY-1E COCTS2 image used in this case study was acquired at approximately 10:00 on 5 April 2025, nearly one day after the wildfire outbreak. At the time of satellite observation, active flame regions were difficult to visually identify in the optical imagery because of smoke coverage, reduced flame intensity, coarse spatial resolution, and complex mountainous terrain. As shown in
Figure 16a, the fire-affected region is barely distinguishable from the surrounding background in the visible bands of the 500 m resolution COCTS2 optical imagery. As illustrated in
Figure 16b, thermally anomalous regions remain observable despite the weak visual response in the optical observations.
To further verify the reliability of the detected wildfire hotspots, a 2 m spatial-resolution image acquired over the same region on 5 April 2025, was additionally analyzed. As shown in
Figure 16c, clear burn scars and fire-affected surface features can be visually identified in the high-resolution imagery, despite being difficult to recognize in the 500 m optical observations from HY-1E COCTS2. The spatial consistency between the high-resolution burn features and the wildfire hotspots detected by SSA-UNet further demonstrates the effectiveness of the proposed framework for identifying residual wildfire activity under coarse-resolution satellite observations.
4.4.3. Practical Limitations
Although the proposed SSA-UNet achieved strong overall performance, several practical limitations remain.
Cloud contamination, smoke coverage, and atmospheric interference may obscure thermal anomalies and reduce detection reliability under complex weather conditions. In addition, the current study is based on a relatively limited number of wildfire scenes. Although extensive patch extraction and data augmentation were employed, additional multi-season and multi-regional datasets are still required to further evaluate model robustness and transferability.
Another limitation arises from the use of FIRMS products as reference labels instead of true ground observations. Since MODIS and VIIRS fire products themselves contain spatial uncertainty, temporal mismatch, and omission errors, the evaluation results should be interpreted as relative consistency with existing operational fire products rather than absolute detection accuracy.
Moreover, rapid wildfire evolution and satellite revisit limitations may introduce temporal observation gaps under extreme weather conditions. Deep learning inference over large-scale remote sensing imagery also requires considerable computational resources, which may limit operational deployment efficiency in near-real-time wildfire monitoring systems.
5. Conclusions and Future Work
5.1. Conclusions
This study develops a deep learning-based framework for fire hotspot detection using multispectral observations from the HY-1E satellite COCTS2, targeting large-scale wildfire monitoring over the Three-North region of China. This study demonstrates that HY-1E COCTS2 multispectral data, when coupled with deep learning, provide a viable solution for large-scale fire hotspot detection under complex environmental conditions. Accurate and timely wildfire hotspot detection is essential for reducing ecological degradation, protecting forest and grassland resources, mitigating carbon emissions caused by wildfires, and enhancing regional resilience in the context of global climate change, thereby contributing to sustainable development goals related to climate action and ecosystem conservation.
The main conclusions of this study are summarized as follows.
(1) Feasibility of HY-1E COCTS2 for wildfire hotspot detection
The multispectral configuration of HY-1E COCTS2, particularly the NIR, SWIR, and TIR bands, exhibits strong sensitivity to thermal anomalies associated with wildfire activity. The results confirm that COCTS2 can effectively detect moderate-to-high-intensity wildfire hotspots and provides a valuable supplementary data source for large-scale land-based wildfire monitoring applications. The capability of continuous and large-scale wildfire observation further supports sustainable management of ecological resources.
(2) Effectiveness of the proposed SSA-UNet framework
The proposed SSA-UNet framework effectively learns nonlinear spectral–spatial representations through end-to-end feature extraction and attention-based optimization, significantly outperforming conventional threshold-based approaches. The improvement is particularly evident in complex environments, where traditional methods suffer from spectral ambiguity and high false-alarm rates. By improving detection accuracy and reducing uncertainty, the proposed framework enhances the reliability of satellite-based wildfire monitoring systems and provides technical support for sustainable disaster prevention and emergency response.
(3) Practical applicability
The Shanxi Qinyuan–Pingyao wildfire case study verified the practical applicability of SSA-UNet for residual wildfire hotspot monitoring under real-world conditions. Even when wildfire signatures were difficult to visually identify in coarse-resolution optical imagery, the proposed framework could still successfully identify thermally anomalous regions associated with residual wildfire activity. This capability is beneficial for post-fire management, ecological restoration, and long-term sustainable land management, especially in ecologically fragile regions.
5.2. Future Work
Although the proposed framework achieved promising results, several challenges still require further investigation.
First, extremely small-scale and low-intensity fires remain difficult to detect because their spectral signatures are easily weakened during convolution and downsampling operations. Future studies should therefore explore more effective multi-scale feature extraction strategies, such as feature pyramid networks and adaptive attention mechanisms, to improve sensitivity to small wildfire hotspots.
Second, the current study was conducted using a relatively limited number of wildfire scenes from the Three-North region of China. Expanding high-quality annotated datasets and conducting cross-regional experiments in different ecological environments will be important for improving model robustness, transferability, and operational reliability.
Third, integrating multi-source satellite observations, including MODIS, VIIRS, and higher-resolution remote sensing imagery, may further improve temporal continuity, detection reliability, and monitoring efficiency. Future work will also investigate lightweight network architectures and near-real-time processing strategies to enhance the operational applicability of deep learning-based wildfire monitoring systems.