LinU-Mamba: Visual Mamba U-Net with Linear Attention to Predict Wildfire Spread

Andrianarivony, Henintsoa S.; Akhloufi, Moulay A.

doi:10.3390/rs17152715

Open AccessArticle

LinU-Mamba: Visual Mamba U-Net with Linear Attention to Predict Wildfire Spread

by

Henintsoa S. Andrianarivony

and

Moulay A. Akhloufi

^*

Perception, Robotics, and Intelligent Machines (PRIME), Department of Computer Science, Université de Moncton, Moncton, NB E1A3E9, Canada

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(15), 2715; https://doi.org/10.3390/rs17152715

Submission received: 9 June 2025 / Revised: 30 July 2025 / Accepted: 4 August 2025 / Published: 6 August 2025

(This article belongs to the Special Issue Advances in Deep Learning Approaches in Remote Sensing (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

Wildfires have become increasingly frequent and intense due to climate change, posing severe threats to ecosystems, infrastructure, and human lives. As a result, accurate wildfire spread prediction is critical for effective risk mitigation, resource allocation, and decision making in disaster management. In this study, we develop a deep learning model to predict wildfire spread using remote sensing data. We propose LinU-Mamba, a model with a U-Net-based vision Mamba architecture, with light spatial attention in skip connections, and an efficient linear attention mechanism in the encoder and decoder to better capture salient fire information in the dataset. The model is trained and evaluated on the two-dimensional remote sensing dataset Next Day Wildfire Spread (NDWS), which maps fire data across the United States with fire entries, topography, vegetation, weather, drought index, and population density variables. The results demonstrate that our approach achieves superior performance compared to existing deep learning methods applied to the same dataset, while showing an efficient training time. Furthermore, we highlight the impacts of pre-training and feature selection in remote sensing, as well as the impacts of linear attention use in our model. As far as we know, LinU-Mamba is the first model based on Mamba used for wildfire spread prediction, making it a strong foundation for future research.

Keywords:

wildfire spread; LinU-Mamba; vision Mamba; linear attention; state space models

Graphical Abstract

1. Introduction

Wildfires are a serious global hazard. They have become more frequent and intense due to climate change. In 2023, wildfires burned an estimated 36.6 million hectares of forest globally, the largest annual area affected by fires since satellite records began in 2001 [1]. Canada was especially hit hard in 2023 with its most destructive fire season on record, with fires burning 15 million hectares across the country [2]. In 2024 only, roughly 5 million hectares were burnt by wildfires in Canada [3]. Occurring naturally or resulting from human activity, wildfires threaten ecosystems, public safety, and cost billions of dollars in infrastructure each year. In 2024, wildfire cost in United-States reached 1.8 billion USD by the end of December [4].

Modern earth observation, like geostationary imagers and polar orbiting sensors, permits the gathering of precious data on active fires and their driving factors. Polar orbiting sensors such as the Moderate Resolution Imaging Spectroradiometer (MODIS) [5] and the Visible Infrared Imaging Radiometer Suite (VIIRS) [6] provide thermal anomaly products that determine active fire fronts worldwide. Geostationary imagers like the Advanced Baseline Imager (ABI) [7] provide multispectral scenes every 5 to 10 min. This delivers near continuous monitoring of fire evolution, smoke transport, and meteorological fields. Higher resolution multispectral instruments, such as the Multispectral Instrument (MSI) [8] on Sentinel-2, supply vegetation indices. All these data permit the evaluation and prediction of the evolution of fires. Accurate spread forecasts can improve mitigation strategies, resource allocation, and the protection of ecosystems and human lives. Therefore, it is crucial to develop reliable wildfire spread prediction models to help fire management authorities, as well as to better study and understand the dynamics behind wildfires.

To predict the spread of wildfires, various deep learning architectures have been explored. Convolutional neural network (CNN) families such as FireCast [9] have been widely used in the field. To integrate temporal evolution, convolutional recurrent models like the one presented in [10] are also used, predicting fire perimeters after 24 h. Ref. [11] presented a vision-Transformer-based approach, where the authors integrate spatial attention and focal modulation into the Swin U-net architecture to predict wildfire spread across North America. Finally, exploratory work used decision-centered reinforcement learning [12] and graph neural networks that trace explicit spread paths [13].

Deep-learning-based approaches still face limitations in the task of wildfire spread forecasting. Noisy labels because of smoke, clouds, and delayed perimeter mapping hide the fire edges that the models learn from, resulting in prediction errors. Moreover, different resolutions in satellite imagery, weather grids, and fuel data together force aggressive resampling that can blur the fine-grained cues driving fire spread. Lastly, the most accurate spatio-temporal architectures remain memory- and GPU-hungry, demanding costly hardware or heavy model pruning.

In this context, to predict wildfire spread after a period of one day, we present LinU-Mamba, a U-Net based on Vision Mamba with spatial attention heads in skip connections, and an efficient linear attention mechanism in the decoder. LinU-Mamba was trained on the dataset Next Day Wildfire Spread (NDWS) [14]. Our main contributions are as follows:

LinU-Mamba shows superior performance compared to existing deep learning methods using the same dataset, while presenting a more efficient training time.
We explore the impact of pre-training and feature selection when using remote sensing data and demonstrate the importance of using linear attention in vision Mamba-based U-Net architectures.
To the best of our knowledge, LinU-Mamba is the first model based on state-space mechanisms in the field of wildfire spread prediction, making it a valuable contribution and opening up potential new directions for future research in this area.

The remainder of this paper is structured as follows. Section 2 reviews deep learning techniques used in wildfire spread prediction, as well as vision Mamba-based U-Net models. Section 3 presents the architecture of LinU-Mamba. Section 4 reports the dataset used and experimental setup for the training of LinU-Mamba. Section 5 reports the performance of the model and ablation studies. Section 6 discusses the results of our experiments. Section 7 concludes the paper.

2. Related Work

2.1. Wildfire Spread Prediction Techniques

Early wildfire spread prediction relied on physics and simple rules rather than data. The Canadian Forest Fire Behavior Prediction System (FBP) [15] couples look-up tables of fuel, weather, and topography to empirical rate-of-spread formulas. In the United States, FARSITE [16] integrates the Rothermel [17] surface fire equation inside a simulator and remains a standard for operational forecasting. Prometheus [18] adapts the same mechanics to Canadian landscapes. Such simulators are transparent, but demand finely prepared inputs and heavy computation. Alternative empirical-based techniques, for example, the cellular-automaton model that reproduced the 1990 Spetses Island burn [19], simplify physics by applying ignition rules on a grid. Those classical approaches struggle to assimilate multimodal satellite data and to generalize across ecosystems.

Data-driven models first appeared as hybrids. Zheng et al. [20] embedded Extreme Machine Learning inside a cellular automaton and reproduced five western American fires with 82% accuracy, while Xu et al. [21] developed LSSVM-CA and attained a 97.9% perimeter overlap on fires in Sichuan forests. Since then, researchers have turned to machine learning models. On tabular datasets, in [22], Khanmohammadi et al. tested several machine learning models, including linear SVMs, boosted trees, and Gaussian processes, and demonstrated that they can reach very low error on Australian grassland burns. Transparent Open Box [23] utilizes a two-stage machine learning data matching technique to extract data mining insights. It begins with an initial data matching step, followed by an optimization process that leverages variable weights and adjustable best match counts. Decision tree and XGBoost regressors [24] achieved a pixel-level root mean square error of 0.15 on the remote sensing NDWS benchmark [14]. Rubí et al. [25] also predicted the spread and the behavior of wildfires. Their approach was designed to predict wildfire spread and behaviors at a specific time in the region of the Brazilian Federal District region. They employed a dataset of the Brazilian Federal District and used multimodal features, such as the fire point of ignition, vegetation, climatic, hydrographic, and anthropogenic factors. They tested different machine learning models, but ultimately, Adaboost gave the best performance with an accuracy of 92.3% on the predicted area. Machine learning methods show better performance than classical models. However, handling visual input and extensive datasets poses a significant challenge for many machine learning models.

Deep learning models excel in wildfire spread prediction because they can capture complex interactions between different variables and handle extensive datasets. Different architectures and approaches have been explored. The use of convolutional neural networks is widespread in the field because of their ability to learn spatial features from satellite imagery. FireCast [9] already surpasses FARSITE [16] on 24 h forecasts while running on modest hardware, and a multikernel CNN [26] obtains an accuracy of 98.6% on a subset of NDWS dataset [14]. U-Net based models have also shown efficacy in this domain. Khennou et al. [27] introduced a deep U-Net architecture called FU-NetCast that uses past fire perimeter maps, satellite data, digital elevation models, and weather features to forecast the next-day burned area, achieving about 92% accuracy in predicting 24 h fire progression. When temporal evolution is critical, a combination of CNNs and recurrent neural networks, such as ConvLSTM with self-attention [10] that achieved an F1-score of 96% on the California wildfires [28]. CNN-BiLSTM [29] is a hybrid architecture combining CNN and long short-term memory (LSTM) modules to make near real-time wildfire spread prediction using VIIRS active fires with different environmental variables. They reached an F1-score of 64% on fire data from Australia. Transformers and attention-based models are also used in the field of wildfire spread prediction. ASUFM Swin U-Net [11] shows a strong performance on the NDWS benchmark dataset [14]. Shadrin et al. [30] employed a Multi-Attention U-Net to integrate satellite imagery with climate variables for 1-to-5-day spread forecasting, reporting F1-scores of between 64% and 68% for predicting burned areas at these longer lead times. Reinforcement-learning surrogates that fuse Monte Carlo Tree Search with A3C reach 92% accuracy on Alberta fire cases [12], and graph neural networks such as the Irregular Graph Network explicitly propagate fire along fuel-slope edges with roughly 80% accuracy on the Getty Fire [13]. Collectively, these deep models deliver the highest performance reported in the review, but the authors caution that disparate datasets and metrics still hinder fair comparison and advocate for open benchmarks and more explainable, real-time architectures to bridge research and operations.

As large language models (LLMs) become more and more present, Ramesh et al. [31] compared the performance of a specialized predictive model based on TabNet with WildfireGPT. They demonstrate that the TabNet-based model trained on variables such as vapor pressure deficit, temperature, pressure, and the Fire Weather Index delivers high correlation with observed NASA Fire Radiative Power (FRP) and lower mean average error and mean square error than the large language model chatbot WildfireGPT. WildfireGPT’s retrieval augmented prompts produce inconsistent and often inaccurate FRP forecasts, exposing the risk of relying on general-purpose conversational AI for quantitative wildfire spread prediction. The authors conclude that operational wildfire management should favor domain-specific models that ingest real-time data and treat LLMs only as supplementary tools, not primary forecasters.

2.2. State-Space Models, Mamba and Vision Mamba

State-Space Models (SSMs) have emerged as a promising alternative to Transformers for long sequence modeling. The Structured State-Space Sequence model (S4) [32] introduced a new parameterization of linear state-space layers that enabled efficient learning of long-range dependencies, achieving state-of-the-art results on benchmarks like the Long Range Arena and even matching 2D CNNs on sequential image tasks. Building on S4, subsequent SSM research addressed specific domains. For instance, SaShiMi applied S4 to raw audio generation, improving the stability of autoregressive SSMs and outperforming prior architectures like WaveNet on waveform modeling tasks [33]. Meanwhile, other efficient sequence models were developed in parallel. Notably, Mega (Moving Average Gated Attention) [34] introduced a single-head attention variant with built-in inductive bias for locality, achieving linear-time complexity via chunking and outperforming both Transformers and prior SSM-based models on long-range benchmarks These advances set the stage for Mamba [35], a selective state-space architecture designed to rival Transformers on general sequence tasks. Mamba [35] introduces input-dependent SSM parameters that enable content-based sequence reasoning, combined with a hardware-efficient parallel scanning algorithm to retain linear time complexity. This design yields a simplified model that achieves five times higher inference throughput than Transformers and scales to sequences of millions of steps, all while attaining state-of-the-art performance across modalities, including language, audio, and genomics. A recent extension, Hydra [36], further demonstrates the power of SSMs by formulating a bidirectional Mamba that excels on non-causal tasks. As a drop-in attention replacement, Hydra outperforms BERT on the GLUE benchmark and even boosts a ViT’s ImageNet accuracy by 2%.

Transformers have also dominated vision tasks, but their self-attention entails quadratic cost for high-resolution imagery. Vision Mamba or Visual Mamba was introduced to bring the efficiency of SSMs to visual recognition. Early attempts to apply SSMs to images and videos like in [37] converted images into 1D sequences. However, naively flattening 2D data breaks local spatial structure, limiting performance on vision tasks [38]. Vision Mamba (Vim) [39] addresses this by incorporating bidirectional SSM layers to capture context along both spatial dimensions, and adding position embeddings for spatial awareness. This pure-SSM backbone processes image patches as sequences, providing each location with data-dependent global context like attentions but with subquadratic complexity. As a result, Vision Mamba achieves competitive or superior accuracy to comparable vision Transformers on ImageNet classification, while being roughly 2.8 times faster and using an order of magnitude less memory for high-resolution inputs. More importantly, the bidirectional state-space design enables dense prediction tasks like segmentation and detection. Following this, researchers have improved visual SSMs by better preserving 2D structure. Spatial-Mamba [38] presents a dilated convolution-based state fusion that models local neighborhoods in the state space, rather than relying on raster scan ordering. This approach unifies the Mamba update and linear attention under one framework and significantly enhances the flow of spatial information. Empirically, Spatial-Mamba achieves state-of-the-art results among SSM-based vision models on image classification, semantic segmentation, and object detection, closing the gap with Transformer-based backbones. Others have explored optimizing the scanning patterns in Vision Mamba. For instance, in LocalMamba [40], the strategy of windowed scanning was shown to preserve local 2D dependencies and boost ImageNet accuracy over naive full-sequence scans. In summary, Mamba and its visual derivatives demonstrate a new paradigm of efficient sequence modeling that rivals or surpasses attention-based models in accuracy while dramatically improving scalability. These works collectively highlight the potential of SSMs to serve as powerful backbones for vision tasks such as segmentation, offering global receptive fields and linear complexity without the overhead of attention.

2.3. Vision Mamba-Based U-Net Models

SSMs and Visual Mamba architecture progress has led to SSM-based modules being integrated into CNN segmentation architectures. Vision Mamba U-Net (VM-UNet) [41] was the one of the first U-shaped encoder–decoder built on Mamba blocks to capture long-range contextual information, using an asymmetrical design to reduce complexity while achieving competitive performance on multiple medical segmentation benchmarks. Likewise, U-Mamba [42] enhanced a U-Net by inserting selective state-space layers, enabling more effective long-range dependency modeling in 2D and 3D biomedical image segmentation than vanilla CNNs. The high-order VMamba U-Net or H-vmunet [43] further extended the VMamba approach by incorporating high-order 2D selective-scan blocks into the U-Net architecture to reduce redundant information and strengthen long-range feature learning. Researchers have also explored hybrid designs. Swin-UMamba [44] combined a Mamba-based encoder–decoder with a hierarchical Swin Transformer backbone to capture both fine local details and global context in medical images. Swin-UMamba outperformed conventional CNNs, pure Vision Transformers, and prior Mamba-based methods on several benchmarks, with pre-training providing a boost of accuracy. Another variant, VMKLA-UNet [45], introduced a bidirectional Mamba backbone augmented by a decoder with KAN linear attention and channel-spatial attention mechanisms, which achieved state-of-the-art segmentation accuracy across diverse medical imaging datasets. Other recent works have proposed additional enhancements. For instance, VMAXL-UNet [46] integrated lightweight LSTM gating into a VMamba U-Net to further improve long-range context fusion. Beyond the medical domain, Mamba-based U-Nets have demonstrated versatility in other fields. Mamba-Adaptor [47] focused on adapting and scaling the VMamba backbone for general vision tasks. In geospatial applications, 3D-UMamba [48] applied Mamba-based U-Net modeling to 3D LiDAR point cloud segmentation, marking one of the earliest uses of selective state-space layers in remote sensing data. Similarly, a Visual Mamba U-Net with multiscale attention has been used for high-resolution agricultural image segmentation to identify defective crop regions [49], showcasing that Vision Mamba U-Net architectures can efficiently capture long-range dependencies across a wide range of segmentation tasks.

3. Methodology

The wildfire spread prediction task is approached as a segmentation task aiming at predicting the next day’s fire mask given multimodal input and the previous day fire mask. We present LinU-Mamba to perform this task.

3.1. Network Architecture

3.1.1. Overall Description

LinU-Mamba is a U-Net encoder-decoder style model based on state-space layers to represent local texture and long-range dependencies. Figure 1 depicts the overall architecture. The input is first processed by a patch embedding via a strided convolution before being fed to the encoder. The encoder structure is based on VSS Blocks and downsampling layers to extract features. The decoder mirrors the encoder structure, with VSS Blocks and upsampling layers. Moreover, the encoder and decoder integrate linear attention modules to capture global context. At each stage of the decoder, gated skip connections with tiny attention heads help the model learn fire line details. A final expanding layer is applied to restore the resolution of the segmentation mask.

3.1.2. Encoder

The input

I \in R^{H \times W \times 3}

is first processed by a Patch Embedding, which slices the input into 4 × 4 and increases the channel count to C, which is 96 by default. The output feature map of dimension

\frac{H}{4} \times \frac{W}{4} \times C

is passed through a normalization layer before being fed to the four stages of the encoder. The encoder stages are composed of VSS Blocks, linear attention layers, and Patch Mergings, except for the last stage. The VSS Block extracts feature information, while the Patch Merging module halves the size of the feature maps and doubles the channel count. After the first stage, feature maps have a resolution of

\frac{H}{8} \times \frac{W}{8} \times 2 C

,

\frac{H}{16} \times \frac{W}{16} \times 4 C

after the second stage,

\frac{H}{32} \times \frac{W}{32} \times 8 C

after the third stage, and keeps that dimension after the last stage.

3.1.3. Decoder

The decoder portion of the model mirrors the four stages of the encoder with a Patch Expanding module at each stage, except the first one, and VSS Blocks. The Patch Expanding module doubles the resolution of feature maps and halves the channel dimension. At each upsampling stage, gated skip connections use a lightweight spatial attention to generate a mask over encoder features before fusing them into the decoder. The feature maps are then fed to the VSS Blocks before being processed by a linear attention block. After the first stage, feature maps have a resolution of

\frac{H}{32} \times \frac{W}{32} \times 8 C

,

\frac{H}{16} \times \frac{W}{16} \times 4 C

after the second stage,

\frac{H}{8} \times \frac{W}{8} \times 2 C

after the third stage, and

\frac{H}{4} \times \frac{W}{4} \times C

after the forth. To produce a segmentation mask, a final projection layer reduces the dimensionality of the feature maps.

3.2. Building Blocks

3.2.1. Linear Attention

Refs. [45,50] successfully integrated linear attention into Mamba-based models. Inspired by those studies, to enhance the ability of U-Net Mamba to highlight salient regions of interest and learn detailed segmentation boundaries, we decided to implement linear attention into our U-Net-like architecture. The linear attention incorporated in LinU-Mamba is derived from efficient attention [51]. A depiction of the linear attention block is presented in Figure 2.

First, the linear attention block uses a single linear layer to project the input feature map into concatenated queries, keys, and values, which are then split across multiple smaller heads. Next, keys are softmax-normalized over their feature channels, and queries are softmax-normalized over spatial positions. All value vectors are then weighted and summed by their key scores to produce a compact context summary per head. That summary is weighted and broadcast to every spatial position using the query scores. The per-head outputs are finally concatenated, passed through a final linear projection, and a residual connection adds the original input back in. This replaces the full N × N similarity matrix, where N is the dimension of the feature map, with one global aggregation and one distribution, yielding true global attention in linear time.

3.2.2. State-Space Models

SSM-based architectures, like in the Mamba [35] approach, sequence processing as a continuous-time dynamical system. In Mamba [35], this system is implemented in the core module S6. Given a one-dimensional input signal

x (t)

, through a hidden state

h (t) \in R^{N}

, the output

y (t) \in R

can be expressed as a linear ordinary differential equation (ODE):

\frac{d}{d t} h (t) = A h (t) + B x (t), y (t) = C h (t),

(1)

where

A \in R^{N \times N}

denotes the state matrix and

B \in R^{N \times 1}

and

C \in R^{1 \times N}

are the weighting parameters.

To be usable in deep learning, the previous equation can be discretized. Sampling with timestep

Δ

and using a zero-order hold yields discrete matrices

\bar{A} = exp (Δ A),

(2)

\bar{B} = {(Δ A)}^{- 1} (exp (Δ A) - I) Δ B,

(3)

so that for integer steps k:

h [k] = \bar{A} h [k - 1] + \bar{B} x [k], y [k] = C h [k] .

(4)

Two equivalent approaches can be adopted to implement the discretized system. A step-by-step recurrence consists of iterating (4) for each k. A global convolution precomputes a length-L kernel, and then computes the output sequence y.

K = [C \bar{B}, C \bar{A} \bar{B}, \dots, C {\bar{A}}^{L - 1} \bar{B}] \in R^{L},

(5)

y = x * K .

(6)

In the continuous-time formulation, the input signal

x (t)

drives the evolution of an N-dimensional hidden state

h (t)

through the linear ordinary differential Equation (1), where the state matrix A encodes how each component of the hidden state interacts with and influences the others, and the input matrix B determines how a unit of the input perturbs each dimension of h. The scalar output

y (t)

is then obtained by applying the read-out row vector C to the hidden state,

y (t) = C h (t)

. To implement this in discrete time with step size

Δ

, one computes the matrix exponential

\bar{A}

(2), which describes the hidden-state transition in the absence of new input, and the discrete input matrix

\bar{B}

(3), which integrates the continuous drive of

B x (t)

over each interval of length

Δ

.

At discrete steps k, the state update becomes (4), and the output is

y [k] = C h [k]

. Finally, one can unroll these recurrences into a finite impulse-response kernel K (5) of length L, so that the entire output sequence is obtained by the convolution

y = x * K

.

State-space models provide an interpretable and mathematically principled way to capture how past inputs influence current outputs via a compact hidden state that flows according to simple linear dynamics. In Mamba, this continuous-time viewpoint is exploited to build long short-term memory modules whose parameters A, B, and C are learned end to end, and whose behavior remains stable and well-understood through the matrix exponential discretization. By converting the learned dynamics into a fixed convolutional kernel, Mamba can efficiently compute sequence outputs in parallel while still retaining the recursive impulse response character of the underlying system. This approach combines the strong theoretical foundations of linear dynamical systems and the practical speed of convolutional implementations.

3.2.3. The VSS Block and the SS2D Module

The entire LinU-Mamba model is built upon VSS Blocks, as VSS Blocks are the core blocks that achieve visual representation. The architecture of a VSS Block is depicted in Figure 3. The VSS Block starts with a normalization layer, then goes through a 2D-Selective-Scan (SS2D) block while keeping a residual shortcut around it. Inside that block, the data is linearly projected, mixed locally by a depth-wise convolution, goes through a SiLU non-linearity, passes through the SS2D module that captures long-range dependencies, normalized again, and finally projected back to the original feature dimension. The result is added to the shortcut to complete the first residual layer. A second residual layer is composed of a normalization layer followed by an optional feed-forward network (FFN). The overall block mirrors a Transformer’s two sublayer layout but swaps attention for state-space modeling.

The 2D-Selective-Scan (SS2D) is derived from VMamba [52]. The three stages of the SS2D, a cross-scan, a selective scan, and a cross-merge, are represented in Figure 4. First, in the cross-scan step, the image is unfolded into four separate sequences of patches that follow different patterns. Each sequence is sent to its own S6 block for parallel processing. The pseudo-code for an S6 block is presented in Algorithm 1. The cross-merge step consists of reshaping and reassembling the resulting sequences to rebuild the feature map. The four scan paths point in complementary directions so that every pixel can gather information from all other pixels, giving the model a global receptive field across the 2D space.

Algorithm 1: Forward pass of an S6 block.

4. Experimental Setup

4.1. Dataset

The dataset used in this study is the NDWS [14], which contains 2D remote sensing input features with fire masks representing historical wildfires across the contiguous United States from 2012 to 2020.

The Next Day Wildfire Spread (NDWS) [14] dataset was assembled with Google Earth Engine [53]. This allowed their authors to blend several national-scale remote-sensing archives into one coherent resource. Its core fire signal is the daily MODIS active-fire mask MOD14A1 [54], which is fused with Shuttle Radar Topography Mission elevations [55], GRIDMET weather grids [56], GRIDMET-derived drought indices [57], VIIRS VNP13A1 vegetation indices [58], and GPWv4 population counts [59]. Across these sources, NDWS provides eleven explanatory channels, namely, elevation, wind direction, wind speed, minimum and maximum temperature, humidity, precipitation, drought index, NDVI, population density, and the Energy Release Component (ERC) fuel-moisture metric. All layers are resampled to 1 km and coregistered in the same WGS-84 projection so that pixels line up perfectly in space and time. Each sample is a 64 km × 64 km tile, a size chosen to enclose most active fires while remaining computationally tractable. For every tile, the authors save two daily snapshots: “the previous fire mask” at day t and the “fire mask” at t + 1. The dataset is purposefully made for next-day spread prediction. Tiles were generated whenever a MODIS hotspot appeared between 2012 and 2020 over the contiguous United States. Figure 5 illustrates samples containing all the channel features.

Fires more than 10 km apart are treated as separate events, yielding 18,545 distinct fire cases in total. Of these, 58% grew, 39% shrank, and the remainder stayed unchanged between the two days, giving the collection a realistic balance of behaviors. Pixels with cloud or other missing observations are flagged as “uncertain”. In the dataset, about 97% of all pixels in the target fire mask are the “no fire” class, 1.87% are uncertain labels, and around 1.06% are the “fire” class. The authors of NDWS have already split the dataset for training, validation, and testing. We kept this partitioning in our study, which consists of 14,979 training samples, 1877 validation samples, and 1689 test samples.

4.2. Data Preprocessing and Data Augmentation

The task of wildfire spread prediction is approached as a deep learning segmentation task. The variables and the “previous fire mask” at time t are used as data features for our model, and the “fire mask” at time t + 1 day as labels. For each data feature, excluding the fire masks, values are initially clipped to a specific minimum and maximum range. These ranges vary per feature. Then, we normalize each feature independently by subtracting its mean and dividing by its standard deviation. The clipping and the normalization are performed according to the data preprocessing in [14]. Adopting a similar strategy as in [14], where uncertain labels are ignored during loss and performance calculations, we clamped the uncertain labels to 0.

For data augmentation strategies, horizontal flips, vertical flips, and random rotations are performed. Each transformation is applied independently with an 80% chance, so most samples receive at least one alteration while a small fraction remain unchanged. Random rotations with 0°, 90°, 180°, or 270° turns are picked. The same operations are applied to both the input and target images, preserving their spatial correspondence.

4.3. Loss

One of the challenges of the dataset is the highly imbalanced class, where the “no fire” or the negative class outnumber the “fire” or positive class by orders of magnitude. The use of a naïve loss would cause the model to predict mostly zeros. Refs. [11,14] used Weighted Binary Cross Entropy (WBCE) to tackle the issue of class imbalance. In our case, we linearly combined a WBCE with a Dice loss to approach the challenge.

The WBCE loss is defined as:

L_{WBCE} = - \frac{1}{N} \sum_{i = 1}^{N} [w_{1} y_{i} log (p_{i}) + w_{0} (1 - y_{i}) log (1 - p_{i})],

(7)

where

p_{i} \in [0, 1]

is the predicted probability,

y_{i} \in {0, 1}

is the ground-truth label, and

w_{1}, w_{0}

control the cost of misclassifying positive and negative pixels, respectively.

The Dice loss is defined as:

L_{Dice} = 1 - \frac{2 \sum_{i} p_{i} y_{i}}{\sum_{i} p_{i} + \sum_{i} y_{i}},

(8)

where the numerator corresponds to twice the area of overlap between prediction and target. Both losses are linearly combined as:

L = w_{b} L_{WBCE} + w_{d} L_{Dice},

(9)

where the scalars

w_{b}

and

w_{d}

let us tune the relative emphasis on the Dice accuracy versus the WBCE, ensuring minority class pixels contribute a proportionate share of the gradient signal. For the WBCE, class weights are put to

(w_{0} = 1, w_{1} = 3)

to penalize false negatives more heavily.

w_{b}

is set to 1 and

w_{d}

to 3. This ensures that even a tiny object can yield a large Dice error if it is missed. This combination gives consistent learning pressure across all training stages and reduces the risk that the network converges to the trivial all background solution.

4.4. Performance Metrics

In segmentation tasks, precision, recall, and the F1-score quantify complementary aspects of predictive quality. Precision measures the proportion of predicted-positive pixels that are truly positive, i.e.,

Precision = \frac{TP}{TP + FP},

(10)

where TP and FP denote the numbers of true positives and false positives, respectively. Recall measures the fraction of all actual positives that the model predicts,

Recall = \frac{TP}{TP + FN},

(11)

where FN represents the false negatives that the model failed to identify. Because precision and recall often trade off against one another, their harmonic mean is expressed as:

F 1 - s c o r e = \frac{2 Precision Recall}{Precision + Recall},

(12)

A good F1-score implies that the model is simultaneously achieving both good precision with few false positives and a good recall with few false negatives. As the F1-score is the harmonic mean of the two metrics, if either one of the two metrics drops, the harmonic mean falls quickly.

4.5. Implementation Details

LinU-Mamba has a depth of [1,1,1,1] for the encoder and the same for the decoder. For some of the experiments, we used the tiny version VMamb-T of pre-trained weights from VMamba [52]. To train the model, we used AdamW optimizer with a learning rate of 0.0005 and a weight decay of 0.1. We used a dropout rate of 0.5 and trained the model for 30 epochs on one NVIDIA GeForce RTX 4090 by NVIDIA Corporation, Santa Clara, CA, USA.

5. Results

We achieved the best performance with LinU-Mamba trained from pre-trained weights VMamba-T and with only three of the feature channels, namely elevation, vegetation, and previous fire mask. Figure 6 shows the results of some samples of predicted fire masks with the three input features. Table 1 compares the performances of our model with other models in the literature using the same dataset.

To ensure results reliability, we evaluated our model LinU-Mamba over 10 independent runs, computing the mean and standard deviation of the F1-score, precision, and recall. In addition, we trained and evaluated recent spatio-temporal models on the NDWS dataset. We ran each of the models 10 times to compute the mean and standard deviation to ensuring result reliability. We also reported the results in Table 1. This comparison is intended to contextualize our model’s performance. As the authors of multikernel CNN and CNN-ASPP [26,60] trained their models on a subset of the dataset, we retrained those reference models on the full dataset to make a comparison to our model relevant. Differences in sampling strategy, preprocessing, and stopping criteria might have affected reported metrics, and exploring these systematically is outside the scope of this paper. We retained the authors’ reported hyperparameters without extensive retuning for the expanded data, which may partly explain the observed gap in performance of their models compared to those reported in the original papers.

For the three external baselines that we did not retrain as the authors reported the use of the full dataset and the same data partitioning (WPN [61], convolutional autoencoder [14], ASUFM [11]), we report the F1-scores as published by their original authors, since seed-level results were not available. We assume that these authors followed standard evaluation protocols, and thus, treated their single-run scores as reliable. This approach provides a comprehensive performance comparison.

LinU-Mamba achieves a precision of 37.25 ± 1.4%, essentially matching the top precision of 37.28% held by ASUFM and substantially outperforming earlier CNN-based approaches (24.41 ± 5.14% for the multikernel CNN and 31.40% for WPN). Its recall of 49.33 ± 2.43% is higher than most methods, including WPN (45.10%), the convolutional autoencoder (43.1%), and ASUFM (43.01%), even though it is slightly below CNN-ASPP recall of 57.71 ± 1.49%.

ASUFM [11] was reported to achieve an F1-score of 41.09%, while LinU-Mamba delivered

42.22 % \pm 0.29 %

. The observed standard deviation (

0.29 %

) is less than half of the absolute improvement (

1.13 % / 2 \approx 0.56 %

), demonstrating that run-to-run variability is lower relative to the gain. Furthermore, the 95% confidence interval half-width, calculated as:

t_{0.975, 9} \frac{0.29 %}{\sqrt{10}} \approx 0.21 %

is far below the 1.13% increase, confirming the performance gain is statistically significant. By using 10 independent runs to compute the performance metrics, we mitigate stochastic effects and ensure reproducibility. Consequently, LinU-Mamba’s improvement over ASUFM is both robust and reliable. LinU-Mamba delivers the best overall balance, attaining the highest F1-score of 42.22 ± 0.29%. This combination of strong precision and above-average recall means LinU-Mamba shows superior performance on the NDWS dataset.

LinU-Mamba, with only 13M-parameter architecture for wildfire spread prediction, demonstrates a more efficient training profile, requiring only 0.2 GFLOPs per forward pass and 0.22 minutes per epoch. Table 2 reports results of model size and training time for LinU-Mamba and other models on the same dataset. Compared to multikernel CNN [26] and CNN-ASPP [60], retrained by ourselves on the full dataset as they only used a subset of the dataset in their original papers, LinU-Mamba reduces computational complexity by roughly 66% and 82%, respectively, if we compare their FLOPs. The smaller per-epoch runtime highlights our model’s efficient design, yielding a 67% speed-up over CNN-ASPP while keeping parameters in the same order of magnitude. For WPN and ASUFM, we relied on their reported performance metrics in the literature, as their original publications did not include FLOPs or training-time measurements, preventing a full runtime comparison. Overall, LinU-Mamba strikes a superior balance between model complexity and training speed, while still showing the strongest performance.

5.1. Ablation Studies

5.1.1. VSS Block

Table 3 reports the results of an ablation study on the VSS Block in LinU-Mamba. Removing all VSS Blocks yields the lowest performance (precision 35.75 ± 0.84%, recall 42.69 ± 1.08%, F1-score 38.90 ± 0.30%), confirming that the state-space module is beneficial. When VSS Blocks are used only in the encoder, recall reaches 49.13 ± 2.46% and the F1-score increases up to 42.20 ± 0.36%, which indicates a considerable boost in capturing fire spread patterns. Placing VSS Blocks only in the decoder produces a similar recall gain (49.48 ± 1.62%) but a slightly lower F1-score (41.82 ± 0.32%). This suggests that decoder-side state space helps less than encoder-side. Enabling VSS Blocks in both encoder and decoder yields an F1-score of 42.22 ± 0.29%, similar to the encoder-only. This pattern implies that most of the VSS Block’s contribution comes from its role in the encoder, with diminishing returns when also applied in the decoder. Overall, the ablation confirms that incorporating VSS Blocks improves segmentation performance, primarily by increasing recall, while additional blocks in the decoder offer limited gain.

5.1.2. Pre-Training

Pre-training has been proven to help deep learning models learn broad, low-level patterns and general abstractions, before learning the specifics of a smaller dataset [62]. We were curious if starting the training from VMamba-T pre-trained weights would give a boost to the performance.

The use of pre-trained weights is a motivation for exploring the use of only three features. VMamba pre-trained weights take three input channels as the model has been pre-trained on RGB images from ImageNet-1K. Fitzgerald et al. [61] and Huot et al. [14] led feature ablation studies and feature analysis studies on the dataset used in this study. Fitzgerald et al. [61] concluded that the combination of elevation, vegetation, and previous fire mask yields the best results. Furthermore, Huot et al. [14] trained their model keeping only two features, the previous fire mask, and each of the other features, and concluded that vegetation and elevation result in the best performance. Based on these results, we have decided to carry out experiments on those three features. Using the three features elevation, vegetation, and previous fire mask, and starting the training from VMamba-T pre-trained weights, yields an increase in performance compared to not utilizing pre-trained weights. LinU-Mamba achieves even better performance with the selection of three input features than with the 12 input features, whether we use pre-trained weights or not. Table 4 presents those comparisons.

5.1.3. Linear Attention

We also conducted ablation studies on the linear attention block in LinU-Mamba. We compared the performance of the model using pre-trained weights and three input channels (elevation, vegetation, previous fire mask), without linear attention, with linear attention in the encoder only, with linear attention in the decoder only, and with linear attention in the encoder and the decoder. Table 5 shows the results of those experiments. LinU-Mamba achieves a peak F1-score of 42.22 ± 0.29% with linear attention in both the encoder and decoder. Although performance does not differ significantly when applying linear attention to only the encoder or only the decoder, all of these configurations substantially outperform the model when linear attention is not used at all.

6. Discussion

LinU-Mamba achieves a strong performance, reaching an F1-score of 42.22 ± 0.29% for the task of predicting wildfire spread after one day compared to other approaches in the literature with the dataset NDWS [14]. In this section, we discuss our architecture design strength and implications, the limitations of the study, and future work directions.

6.1. Architecture Implications

LinU-Mamba demonstrates a clear advantage over conventional CNN-based wildfire spread predictors, outperforming past ASUFM’s F1-score of 41.09% to achieve an F1-score of 42.22 ± 0.29%, which is an improvement significantly larger than its noise fluctuations. In terms of computational cost, LinU-Mamba’s 13 M-parameter architecture requires only 0.2 GFLOPs per forward pass and 0.22 min per epoch, delivering a 67% speed-up compared to CNN-ASPP (1.12 G FLOPs, 0.67 mn/epoch) and reducing FLOPs by 82% without inflating parameter count. This efficiency suggests that LinU-Mamba can be trained more rapidly and at lower resource cost, which broadens its applicability for real-time operational forecasting.

The VSS Block ablation study further underscores the pivotal role of state-space modules. Removing all VSS Blocks drops the F1-score to 38.90 ± 0.30%, while introducing them in the encoder alone recovers nearly all of the gain 42.20 ± 0.36%, and full encoder–decoder inclusion reaches an F1-score of 42.22 ± 0.29%. These results reveal that encoder-side VSS Blocks drive most of the performance improvement by enhancing long-range dependency modeling, while decoder-side additions offer less significant results. The pronounced drop in both precision and recall when omitting VSS Blocks confirms their essential function in capturing global fire dynamics. The negligible difference between encoder-only and full VSS configurations shows a well-calibrated architecture that avoids unnecessary complexity.

Taken together, these findings illustrate that LinU-Mamba’s core design, which combines efficient state-space layers with lightweight spatial skip attentions and global linear attention, reaches a compelling balance between accuracy and efficiency. The combination of state-space modules with linear attention yields segmentation quality improvements without prohibitive resource demands. The low variability across runs and the statistical significance of F1-score gains testify to the model’s robustness. Compared with conventional CNNs that often require larger backbones and longer training times, LinU-Mamba’s architecture presents a scalable solution for wildfire monitoring.

6.2. Further Analysis

As presented in Section 5.1.2, we conducted ablation studies on the use of pre-trained weights by training LinU-Mamba with and without VMamba-T weights. LinU-Mamba’s pre-training ablation indicates only a modest F1-score gain of 0.13 points when initializing from VMamba-T ImageNet weights for the three-channel model (42.22 ± 0.29% vs. 42.09 ± 0.35%), driven by a rise in precision (37.25% vs. 36.65%) alongside a slight dip in recall (49.33% vs. 49.56%). This suggests that low-level visual priors improve fire-pixel discrimination. The pre-trained variant also shows lower fluctuations, pointing to more stable convergence under limited remote-sensing data. By contrast, the twelve-feature model without pre-training underperforms even the non-pre-trained three-feature baseline (F1-score of 41.52%), which highlights the stronger influence of input selection over weight initialization. This disparity underscores a nuanced trade-off. While pre-training marginally accelerates abstraction of generic patterns, selecting the most informative channels (elevation, vegetation, prior fire mask) remains pivotal for segmentation accuracy. The slight recall dip in the pre-trained model shows that ImageNet-derived filters may bias the network toward precision by reinforcing texture cues less common in wildfire imagery. These results suggest that transfer learning delivers helpful low-level priors but that its benefit is secondary to rigorous feature curation in wildfire spread prediction.

As shown in Section 5.1.3, we also conducted ablation studies on the linear attention incorporated in our model. The ablation of global linear attention modules in LinU-Mamba reveals a clear reduction in segmentation performance, as removing these layers leads to a marked decline in F1-score (41.57 ± 0.35% vs. 42.22 ± 0.29%) and precision (35.69 ± 0.76% vs. 37.25 ± 1.49%). This highlights their critical role in modeling far-reaching spatial dependencies. Unlike the VSS Blocks that primarily propagate temporal information, linear attention layers excel at aggregating spatial features across the entire frame, which in turn produces finer fire front contours and reduces isolated false positives. Regarding efficiency, linear attention preserves linear runtime and memory scaling, which offers faster training and inference compared to the multipass Mamba selective scan, which requires several directional sweeps to approximate the global context. Furthermore, linear attention integrates smoothly into the UNet-like encoder–decoder framework of LinU-Mamba, alongside VSS Blocks and spatial skip attentions, without introducing specialized operators. This simplifies how the model is designed. The ablation confirms that linear attention strikes a superior balance between segmentation quality and computational efficiency. Therefore, incorporating global linear attention represents an effective architectural refinement for LinU-Mamba.

6.3. Study Limitations and Future Work

A key limitation of the present dataset is its temporal depth, as it supplies only a single day of previous fire information and key variable data. Modern spatio-temporal networks typically learn more stable representations when they can learn from several preceding days of fire evolution. With this lack of previous information context, the model must infer spread dynamics from a single snapshot. This increases randomness and the risk of prediction from short-term fluctuations. The spatial resolution also presents an important limitation. The 1 km spatial resolution is very coarse, which fails to integrate critical heterogeneities such as fuel breaks, creek lines, or abrupt slope changes. However, these heterogeneities often determine local spread rates and direction. At this scale, subkilometer ignition clusters blend into single pixels, which makes it difficult for models to learn accurate propagation patterns. These temporal and spatial constraints limit the model’s capacity to capture fine-grained fire details, and they likely contribute to over-smoothing prediction details. Future work should prioritize assembling longer temporal sequences and using better spatial resolution to harness the full predictive potential of deep learning approaches.

While global linear attention and VSS Blocks significantly improve long-range dependency modeling, they can unintentionally oversmooth fine fire-front details. Future work could investigate hybrid attention schemes or adaptive attention spans that dynamically balance local high-frequency feature preservation with global context aggregation. In addition, despite boosting feature fusion, the added spatial skip-attention modules introduce non-trivial parameter and computational overhead. Exploring lightweight alternatives such as dynamic gating, cross-scale Transformers, or grouped convolutions could maintain fusion benefits while reducing complexity. Last, the fixed stacking of state space and attention blocks limits architectural flexibility. Incorporating conditional computation, adaptive-depth networks, or neural architecture search could enable the model to tailor its capacity to varying landscape heterogeneity and fire dynamics.

7. Conclusions

We introduced LinU-Mamba, a UNet-like encoder–decoder framework with global linear attention and spatial attention in skip connections. On the NDWS prediction dataset [14], LinU-Mamba sets a strong F1-score while being computationally efficient. We believe LinU-Mamba represents the first application of a Mamba-based model for wildfire spread prediction. This establishes a strong foundation for future advancements in wildfire spread prediction. Ablations show that using pre-trained weights on RGB images helps the performance, even when trained on remote sensing images with very different textures, but feature selection boosts the performances even more significantly. Adding global linear attention significantly improves the model’s design, making it much better at predicting wildfires accurately. The temporal depth of 1 day of the dataset hinders the learning capabilities of the model. The dataset’s 1 km spatial resolution does not provide enough detail for our model to accurately capture fire spread. Those limitations likely contribute to the generally low prediction performance on the dataset. To fully leverage the predictive potential of deep learning, future research should prioritize assembling longer temporal sequences. Additionally, utilizing improved spatial resolution will be crucial for enhanced prediction accuracy.

Author Contributions

Conceptualization, M.A.A. and H.S.A.; methodology, H.S.A. and M.A.A.; validation, H.S.A. and M.A.A.; formal analysis, H.S.A. and M.A.A.; writing—original draft preparation, H.S.A.; writing—review and editing, M.A.A.; funding acquisition, M.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was enabled in part by support provided by the Natural Sciences and Engineering Research Council of Canada (NSERC), funding reference number RGPIN-2024-05287.

Data Availability Statement

This work uses a publicly available dataset; see reference [14] for data availability. More details about this dataset are available in Section 4.1.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

NDWS	Next Day Wildfire Spread
MODIS	Moderate Resolution Imaging Spectroradiometer
VIIRS	Visible Infrared Imaging Radiometer Suite
ABI	Advanced Baseline Imager
MSI	Multispectral Instrument
CNN	Convolutional neural network
LSTM	Long short-term memory
SHAP	SHapley Additive exPlanations
FBP	Canadian Forest Fire Behavior Prediction System
LLMs	Large language models
FRP	Fire Radiative Power
VSS	Visual State Space
SSMs	State-space models
Vim	Vision Mamba
ODEs	Ordinary differential equations
SS2D	2D-Selective-Scan
FFN	Feed-forward network
ERC	Energy release component
WBCE	Weighted Binary Cross Entropy

References

Potapov, P.; Tyukavina, A.; Turubanova, S.; Hansen, M.C.; Giglio, L.; Hernandez-Serna, A.; Lima, A.; Harris, N.; Stolle, F. Unprecedentedly high global forest disturbance due to fire in 2023 and 2024. Proc. Natl. Acad. Sci. USA 2025, 122, e2505418122. [Google Scholar] [CrossRef]
Canada’s Record-Breaking Wildfires in 2023: A Fiery Wake-Up Call. Available online: https://natural-resources.canada.ca/stories/simply-science/canada-s-record-breaking-wildfires-2023-fiery-wake-call (accessed on 26 July 2025).
Annual Area Burnt by Wildfires. Available online: https://ourworldindata.org/grapher/annual-area-burnt-by-wildfires (accessed on 4 June 2025).
Billion-Dollar Weather and Climate Disasters. Available online: https://www.ncei.noaa.gov/access/billions/ (accessed on 26 July 2025).
MODIS Moderate Resolution Imaging Spectroradiometer. Available online: https://modis.gsfc.nasa.gov/ (accessed on 4 June 2025).
VIIRS Visible Infrared Imaging Radiometer Suite. Available online: https://www.earthdata.nasa.gov/data/instruments/viirs (accessed on 4 June 2025).
ABI Advanced Baseline Imager. Available online: http://earthdata.nasa.gov/data/instruments/abi (accessed on 4 June 2025).
Sentinel-2 MSI. Available online: https://www.earthdata.nasa.gov/data/instruments/sentinel-2-msi (accessed on 4 June 2025).
Radke, D.; Hessler, A.; Ellsworth, D. FireCast: Leveraging Deep Learning to Predict Wildfire Spread. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 10–16 August 2019; pp. 4575–4581. [Google Scholar]
Masrur, A.; Yu, M.; Taylor, A. Capturing and Interpreting Wildfire Spread Dynamics: Attention-Based Spatiotemporal Models Using ConvLSTM Networks. Ecol. Inform. 2024, 82, 102760. [Google Scholar] [CrossRef]
Li, B.S.; Rad, R. Wildfire Spread Prediction in North America Using Satellite Imagery and Vision Transformer. In Proceedings of the 2024 IEEE Conference on Artificial Intelligence (CAI), Singapore, 25–27 June 2024; pp. 1536–1541. [Google Scholar]
Subramanian, S.G.; Crowley, M. Combining MCTS and A3C for Prediction of Spatially Spreading Processes in Forest Wildfire Settings. In Proceedings of the Advances in Artificial Intelligence, 31st Canadian Conference on Artificial Intelligence, Canadian AI 2018, Toronto, ON, Canada, 8–11 May 2018; pp. 285–291. [Google Scholar]
Jiang, W.; Wang, F.; Su, G.; Li, X.; Wang, G.; Zheng, X.; Wang, T.; Meng, Q. Modeling Wildfire Spread with an Irregular Graph Network. Fire 2022, 5, 185. [Google Scholar] [CrossRef]
Huot, F.; Hu, R.L.; Goyal, N.; Sankar, T.; Ihme, M.; Chen, Y.F. Next Day Wildfire Spread: A Machine Learning Dataset to Predict Wildfire Spreading from Remote-Sensing Data. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4412513. [Google Scholar] [CrossRef]
Hirsch, K. Canadian Forest Fire Behavior Prediction (FBP) System: User’s Guide; Canadian Forest Service, Northern Forestry Centre: Edmonton, AB, Canada, 1996.
Finney, M.A. FARSITE, Fire Area Simulator-Model Development and Evaluation; U.S. Department of Agriculture, Forest Service, Rocky Mountain Research Station: Washington, DC, USA, 1998.
Rothermel, R. A Mathematical Model for Predicting Fire Spread in Wildland Fuels. In The Bark Beetles, Fuels, and Fire Bibliography; US Forest Service: Washington, DC, USA, 1972; Volume Res. Pap. INT-115, pp. 1–40. [Google Scholar]
Prometheus. Available online: https://firegrowthmodel.ca/#/prometheus_overview (accessed on 4 June 2025).
Alexandridis, A.; Vakalis, D.; Siettos, C.I.; Bafas, G.V. A cellular automata model for forest fire spread prediction: The case of the wildfire that swept through Spetses Island in 1990. Appl. Math. Comput. 2008, 204, 191–201. [Google Scholar] [CrossRef]
Zheng, Z.; Huang, W.; Li, S.; Zeng, Y. Forest Fire Spread Simulating Model Using Cellular Automaton with Extreme Learning Machine. Ecol. Model. 2017, 348, 33–43. [Google Scholar] [CrossRef]
Xu, Y.; Li, D.; Ma, H.; Lin, R.; Zhang, F. Modeling Forest Fire Spread Using Machine Learning-Based Cellular Automata in a GIS Environment. Forests 2022, 13, 1974. [Google Scholar] [CrossRef]
Khanmohammadi, S.; Arashpour, M.; Golafshani, E.M.; Cruz, M.G.; Rajabifard, A.; Bai, Y. Prediction of Wildfire Rate of Spread in Grasslands Using Machine Learning Methods. Environ. Model. Softw. 2022, 156, 105507. [Google Scholar] [CrossRef]
Wood, D.A. Prediction and Data Mining of Burned Areas of Forest Fires: Optimized Data Matching and Mining Algorithm Provides Valuable Insight. Artificial Intelligence in Agriculture. Artif. Intell. Agric. 2021, 5, 24–42. [Google Scholar] [CrossRef]
Singh, A.; Yadav, R.; Sudhamshu, G.; Basnet, A.; Ali, R. Wildfire Spread Prediction Using Machine Learning Algorithms. In Proceedings of the 2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT), Delhi, India, 6–8 July 2023; pp. 1–5. [Google Scholar]
Rubí, J.N.S.; de Carvalho, P.H.P.; Gondim, P.R.L. Application of Machine Learning Models in the Behavioral Study of Forest Fires in the Brazilian Federal District Region. Eng. Appl. Artif. Intell. 2023, 118, 105649. [Google Scholar] [CrossRef]
Marjani, M.; Mesgari, M.S. The Large-Scale Wildfire Spread Prediction Using a Multi-Kernel Convolutional Neural Network. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, 14W1, 483–488. [Google Scholar] [CrossRef]
Khennou, F.; Ghaoui, J.; Akhloufi, M.A. Forest Fire Spread Prediction Using Deep Learning. In Proceedings of the Geospatial Informatics XI, Bellingham, FL, USA, 10–11 December 2021; pp. 106–117. [Google Scholar]
Chen, Y.; Hantson, S.; Andela, N.; Coffield, S.R.; Graff, C.A.; Morton, D.C.; Ott, L.E.; Foufoula-Georgiou, E.; Smyth, P.; Goulden, M.L.; et al. California Wildfire Spread Derived Using VIIRS Satellite Observations and an Object-Based Tracking System. Sci. Data 2022, 9, 249. [Google Scholar] [CrossRef]
Marjani, M.; Mahdianpari, M.; Mohammadimanesh, F. CNN-BiLSTM: A Novel Deep Learning Model for Near-Real-Time Daily Wildfire Spread Prediction. Remote Sens. 2024, 16, 1467. [Google Scholar] [CrossRef]
Shadrin, D.; Illarionova, S.; Gubanov, F.; Evteeva, K.; Mironenko, M.; Levchunets, I.; Belousov, R.; Burnaev, E. Wildfire Spreading Prediction Using Multimodal Data and Deep Neural Network Approach. Sci. Rep. 2024, 14, 2606. [Google Scholar] [CrossRef]
Ramesh, M.; Sun, Z.; Li, Y.; Zhang, L.; Annam, S.K.; Fang, H.; Tong, D. Assessing WildfireGPT: A comparative analysis of AI models for quantitative wildfire spread prediction. Nat. Hazards 2025, 121, 13117–13130. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; Re, C. Efficiently Modeling Long Sequences with Structured State Spaces. In Proceedings of the International Conference on Learning Representations, Virtual, 25 April 2022. [Google Scholar]
Goel, K.; Gu, A.; Donahue, C.; Ré, C. It is Raw! Audio Generation with State-Space Models. arXiv 2022, arXiv:2202.09729. [Google Scholar] [CrossRef]
Ma, X.; Zhou, C.; Kong, X.; He, J.; Gui, L.; Neubig, G.; May, J.; Zettlemoyer, L. Mega: Moving Average Equipped Gated Attention. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar] [CrossRef]
Hwang, S.; Lahoti, A.S.; Puduppully, R.; Dao, T.; Gu, A. Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers. In Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, 10–15 December 2024; Curran Associates, Inc.: Red Hook, NY, USA, 2024; pp. 1–33. [Google Scholar]
Nguyen, E.; Goel, K.; Gu, A.; Downs, G.W.; Shah, P.; Dao, T.; Baccus, S.A.; Ré, C. S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces. In Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, 28 November–9 December 2022; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
Xiao, C.; Li, M.; ZHANG, Z.; Meng, D.; Zhang, L. Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion. In Proceedings of the Thirteenth International Conference on Learning Representations, Singapore, 24–28 April 2025. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. In Proceedings of the 41st International Conference on Machine Learning, ICML’24, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual State Space Model with Windowed Selective Scan. arXiv 2024, arXiv:2403.09338. [Google Scholar] [CrossRef]
Ruan, J.; Li, J.; Xiang, S. VM-UNet: Vision Mamba UNet for Medical Image Segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar] [CrossRef]
Ma, J.; Li, F.; Wang, B. U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar] [CrossRef]
Wu, R.; Liu, Y.; Liang, P.; Chang, Q. H-vmunet: High-order Vision Mamba UNet for medical image segmentation. Neurocomputing 2025, 624, 129447. [Google Scholar] [CrossRef]
Liu, J.; Yang, H.; Zhou, H.Y.; Xi, Y.; Yu, L.; Li, C.; Liang, Y.; Shi, G.; Yu, Y.; Zhang, S.; et al. Swin-UMamba: Mamba-Based UNet with ImageNet-Based Pretraining. In Medical Image Computing and Computer Assisted Intervention—MICCAI 2024; Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A., Eds.; Springer: Cham, Switzerland, 2024; pp. 615–625. [Google Scholar] [CrossRef]
Su, C.; Luo, X.; Li, S.; Chen, L.; Wang, J. VMKLA-UNet: Vision Mamba with KAN linear attention U-Net. Sci. Rep. 2025, 15, 13258. [Google Scholar] [CrossRef] [PubMed]
Zhong, X.; Lu, G.; Li, H. Vision Mamba and xLSTM-UNet for medical image segmentation. Sci. Rep. 2025, 15, 8163. [Google Scholar] [CrossRef]
Xie, F.; Nie, J.; Tang, Y.; Zhang, W.; Zhao, H. Mamba-Adaptor: State Space Model Adaptor for Visual Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 20124–20134. [Google Scholar]
Lu, D.; Xu, L.; Zhou, J.; Gao, K.; Gong, Z.; Zhang, D. 3D-UMamba: 3D U-Net with state space model for semantic segmentation of multi-source LiDAR point clouds. Int. J. Appl. Earth Obs. Geoinf. 2025, 136, 104401. [Google Scholar] [CrossRef]
Zhao, K.; Zhang, Q.; Wan, C.; Pan, Q.; Qin, Y. Visual Mamba UNet fusion multi-scale attention and detail infusion for unsound corn kernels segmentation. Sci. Rep. 2025, 15, 10933. [Google Scholar] [CrossRef]
Han, D.; Wang, Z.; Xia, Z.; Han, Y.; Pu, Y.; Ge, C.; Song, J.; Song, S.; Zheng, B.; Huang, G. Demystify Mamba in Vision: A Linear Attention Perspective. In Proceedings of the NeurIPS, Vancouver, BC, Canada, 10–15 December 2024. [Google Scholar]
Li, Y.; Wang, Y.; Shao, X.; Zheng, A. An efficient fire detection algorithm based on Mamba space state linear attention. Sci. Rep. 2025, 15, 11289. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
Google Earth Engine. Available online: https://earthengine.google.com/ (accessed on 26 July 2025).
MODIS/Terra Thermal Anomalies/Fire Daily L3 Global 1km SIN Grid V006. Available online: https://www.earthdata.nasa.gov/data/catalog/lpcloud-mod14a1-006 (accessed on 26 July 2025).
Farr, T.G.; Rosen, P.A.; Caro, E.; Crippen, R.; Duren, R.; Hensley, S.; Kobrick, M.; Paller, M.; Rodriguez, E.; Roth, L.; et al. The Shuttle Radar Topography Mission. Rev. Geophys. 2007, 45, RG2004. [Google Scholar] [CrossRef]
Abatzoglou, J.T. Development of gridded surface meteorological data for ecological applications and modelling. Int. J. Climatol. 2013, 33, 121–131. [Google Scholar] [CrossRef]
Abatzoglou, J.T.; Rupp, D.E.; Mote, P.W. Seasonal Climate Variability and Change in the Pacific Northwest of the United States. J. Clim. 2014, 27, 2125–2142. [Google Scholar] [CrossRef]
Didan, K.; Barreto, A. VIIRS/NPP Vegetation Indices 16-Day L3 Global 500m SIN Grid V001; NASA Land Processes Distributed Active Archive Center: Sioux Falls, SD, USA, 2018. [CrossRef]
Gridded Population of the World. Available online: https://www.earthdata.nasa.gov/data/projects/gpw (accessed on 26 July 2025).
Marjani, M.; Mahdianpari, M.; Ali Ahmadi, S.; Hemmati, E.; Mohammadimanesh, F.; Saadi Mesgari, M. Application of Explainable Artificial Intelligence in Predicting Wildfire Spread: An ASPP-Enabled CNN Approach. IEEE Geosci. Remote Sens. Lett. 2024, 21, 2504005. [Google Scholar] [CrossRef]
Fitzgerald, J.; Seefried, E.; Yost, J.E.; Pallickara, S.; Blanchard, N. Paying Attention to Wildfire: Using U-Net with Attention Blocks on Multimodal Data for Next Day Prediction. In Proceedings of the 25th International Conference on Multimodal Interaction, New York, NY, USA, 9 October 2023; pp. 470–480. [Google Scholar]
Lahrichi, S.; Johnson, J.; Malof, J. Predicting Next-Day Wildfire Spread with Time Series and Attention. arXiv 2025, arXiv:2502.12003. [Google Scholar] [CrossRef]

Figure 1. Overview of LinU-Mamba architecture with all the core blocks of the model, including VSS Blocks, linear attention, and spatial attention in skip connections.

Figure 2. Illustration of the architecture of the linear attention. Input features are linearly projected into key, query and value vectors, the keys are softmax-normalized and multiplied with the values to form compact context vectors. Those context vectors are then matrix-multiplied with the queries and passed through a final linear projection to produce the output.

Figure 3. Structure of the VSS block in LinU-Mamba showing the core component SS2D.

Figure 4. Illustration of 2D-Selective-Scan (SS2D). Input patches are processed along four cross-scan paths, with each path independently fed into a separate S6 block. The resulting sequences are then cross-merged to form the final 2D feature map.

Figure 5. Examples from NDWS [14], showcasing 64 km × 64 km patches at 1 km resolution. Each row represents a specific location and time (t), displaying the input features alongside the previous fire mask (fire locations at time t) and the fire mask for the following day (t + 1 day). Fire is indicated by red, no fire by gray, and uncertain labels by black.

Figure 6. Examples of comparison of fire masks with predicted fire masks alongside their corresponding three input features (elevevation, vegetation, and previous fire mask). Fire is indicated by red, no fire by gray.

Table 1. Performance comparison of LinU-Mamba with other methods on the NDWS dataset.

Methods	Precision (%) Mean ± SD	Recall (%) Mean ± SD	F1-Score (%) Mean ± SD
Multikernel CNN [26]	24.41 ± 5.14	17.58 ± 2.74	20.06 ± 1.88
CNN-ASPP [60]	26.20 ± 1.09	57.71 ± 1.49	36.01 ± 0.78
WPN [61]	31.40	45.10	37.02
Conv autoencoder [14]	33.6	43.1	37.76
ASUFM [11]	37.28	43.01	41.09
LinU-Mamba	37.25 ± 1.49	49.33 ± 2.43	42.22 ± 0.29

Bold indicates the highest score in each column.

Table 2. Training performance comparison of LinU-Mamba with other models.

Methods	Parameters	FLOPs	Training Time (1 Epoch)	F1-Score (%) Mean ± SD
Multikernel CNN [26]	12.3 M	0.59 G	0.13 mn	20.06 ± 1.88
CNN-ASPP [60]	27.5 M	1.12 G	0.67 mn	36.01 ± 0.78
WPN [61]	8.7 M	-	-	37.02
ASUFM [11]	35 M	-	-	41.09
LinU-Mamba	13 M	0.2 G	0.22 mn	42.22 ± 0.29

Bold indicates the highest score in the F1-score column.

Table 3. VSS Block ablation performance comparison of LinU-Mamba.

Methods	Precision (%) Mean ± SD	Recall (%) Mean ± SD	F1-Score (%) Mean ± SD
No VSSBlock	35.75 ± 0.84	42.69 ± 1.08	38.90 ± 0.30
Encoder	37.14 ± 1.69	49.13 ± 2.46	42.20 ± 0.36
Decoder	36.27 ± 1.14	49.48 ± 1.62	41.82 ± 0.32
Both	37.25 ± 1.49	49.33 ± 2.43	42.22 ± 0.29

Bold indicates the highest score in each column.

Table 4. Performance comparison of LinU-Mamba using 3 features (elevation, vegetation, previous fire mask) with pre-trained weights and using 12 features.

Pre-Training	Features	Precision (%) Mean ± SD	Recall (%) Mean ± SD	F1-Score (%) Mean ± SD
No	3	36.65 ± 1.25	49.56 ± 1.75	42.09 ± 0.35
Yes	3	37.25 ± 1.49	49.33 ± 2.43	42.22 ± 0.29
No	12	35.31 ± 0.86	50.48 ± 1.65	41.52 ± 0.27

Bold indicates the highest score in each column.

Table 5. Linear attention ablation performance comparison of LinU-Mamba.

Methods	Precision (%) Mean ± SD	Recall (%) Mean ± SD	F1-Score (%) Mean ± SD
No attention	35.69 ± 0.76	49.84 ± 1.40	41.57 ± 0.35
Encoder	36.38 ± 1.14	50.24 ± 1.43	42.17 ± 0.38
Decoder	36.51 ± 0.99	50.01 ± 1.20	42.18 ± 0.33
Both	37.25 ± 1.49	49.33 ± 2.43	42.22 ± 0.29

Bold indicates the highest score in each column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Andrianarivony, H.S.; Akhloufi, M.A. LinU-Mamba: Visual Mamba U-Net with Linear Attention to Predict Wildfire Spread. Remote Sens. 2025, 17, 2715. https://doi.org/10.3390/rs17152715

AMA Style

Andrianarivony HS, Akhloufi MA. LinU-Mamba: Visual Mamba U-Net with Linear Attention to Predict Wildfire Spread. Remote Sensing. 2025; 17(15):2715. https://doi.org/10.3390/rs17152715

Chicago/Turabian Style

Andrianarivony, Henintsoa S., and Moulay A. Akhloufi. 2025. "LinU-Mamba: Visual Mamba U-Net with Linear Attention to Predict Wildfire Spread" Remote Sensing 17, no. 15: 2715. https://doi.org/10.3390/rs17152715

APA Style

Andrianarivony, H. S., & Akhloufi, M. A. (2025). LinU-Mamba: Visual Mamba U-Net with Linear Attention to Predict Wildfire Spread. Remote Sensing, 17(15), 2715. https://doi.org/10.3390/rs17152715

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LinU-Mamba: Visual Mamba U-Net with Linear Attention to Predict Wildfire Spread

Abstract

1. Introduction

2. Related Work

2.1. Wildfire Spread Prediction Techniques

2.2. State-Space Models, Mamba and Vision Mamba

2.3. Vision Mamba-Based U-Net Models

3. Methodology

3.1. Network Architecture

3.1.1. Overall Description

3.1.2. Encoder

3.1.3. Decoder

3.2. Building Blocks

3.2.1. Linear Attention

3.2.2. State-Space Models

3.2.3. The VSS Block and the SS2D Module

4. Experimental Setup

4.1. Dataset

4.2. Data Preprocessing and Data Augmentation

4.3. Loss

4.4. Performance Metrics

4.5. Implementation Details

5. Results

5.1. Ablation Studies

5.1.1. VSS Block

5.1.2. Pre-Training

5.1.3. Linear Attention

6. Discussion

6.1. Architecture Implications

6.2. Further Analysis

6.3. Study Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI