Wildfire Prediction in British Columbia Using Machine Learning and Deep Learning Models: A Data-Driven Framework

Nasourinia, Maryam; Passi, Kalpdrum

doi:10.3390/bdcc9110290

Open AccessArticle

Wildfire Prediction in British Columbia Using Machine Learning and Deep Learning Models: A Data-Driven Framework

by

Maryam Nasourinia

and

Kalpdrum Passi

^*

School of Engineering and Computer Science, Laurentian University, Sudbury, ON P3E 2C6, Canada

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2025, 9(11), 290; https://doi.org/10.3390/bdcc9110290

Submission received: 2 September 2025 / Revised: 8 November 2025 / Accepted: 10 November 2025 / Published: 14 November 2025

(This article belongs to the Topic AI for Natural Disasters Detection, Prediction and Modeling)

Download

Browse Figures

Versions Notes

Abstract

Wildfires pose a growing threat to ecosystems, infrastructure, and public safety, particularly in the province of British Columbia (BC), Canada. In recent years, the frequency, severity, and scale of wildfires in BC have increased significantly, largely due to climate change, human activity, and changing land use patterns. This study presents a comprehensive, data-driven approach to wildfire prediction, leveraging advanced machine learning (ML) and deep learning (DL) techniques. A high-resolution dataset was constructed by integrating five years of wildfire incident records from the Canadian Wildland Fire Information System (CWFIS) with ERA5 reanalysis climate data. The final dataset comprises more than 3.6 million spatiotemporal records and 148 environmental, meteorological, and geospatial features. Six feature selection techniques were evaluated, and five predictive models—Random Forest, XGBoost, LightGBM, CatBoost, and an RNN + LSTM—were trained and compared. The CatBoost model achieved the highest predictive performance with an accuracy of 93.4%, F1-score of 92.1%, and ROC-AUC of 0.94, while Random Forest achieved an accuracy of 92.6%. The study identifies key environmental variables, including surface temperature, humidity, wind speed, and soil moisture, as the most influential predictors of wildfire occurrence. These findings highlight the potential of data-driven AI frameworks to support early warning systems and enhance operational wildfire management in British Columbia.

Keywords:

wildfire prediction; British Columbia; machine learning; deep learning; feature selection; Canadian wildland fire information system; climate reanalysis data; ERA5

1. Introduction

Wildfires have increasingly become one of the most critical global environmental challenges, threatening human communities, natural ecosystems, and biodiversity. Intensified by rising temperatures and drier atmospheric conditions associated with climate change, wildfire events are now occurring with greater frequency, severity, and spatial extent. According to Global Forest Watch, an average of 4.2 million hectares of forest are consumed annually by wildfires, incurring economic losses estimated at USD 20 billion globally. In addition to immediate destruction, wildfires significantly degrade air quality, disrupt ecological balance, and contribute to greenhouse gas emissions and global warming.

Traditional wildfire prediction systems, such as the Canadian Fire Weather Index (FWI) and statistical models, often rely on expert knowledge and linear assumptions. These methods struggle to capture the complex, nonlinear, and high-dimensional interactions between meteorological, topographical, and biological variables that drive wildfire behavior. The emergence of data-driven approaches—particularly machine learning (ML) and deep learning (DL) techniques—offers powerful tools capable of learning intricate patterns from multi-source environmental data. ML/DL algorithms are now increasingly applied to fire detection, forecasting, weather monitoring, and risk mapping.

Despite their potential, several limitations hinder the operational use of ML in wildfire prediction. These include the scarcity of high-resolution and reliable datasets, challenges in model interpretability, and difficulties in ensuring generalization across diverse geographic regions. In British Columbia (BC), where vast forests and diverse climates make wildfire management especially complex, these limitations are compounded by inconsistent data collection and limited integration across sources.

The present study is motivated by the urgent need for more accurate and interpretable wildfire prediction systems in BC, particularly in light of recent catastrophic fire seasons. In 2023, Canada experienced its most destructive wildfire year on record, with over 17 million hectares burned. In response, national initiatives have been launched to integrate AI and ML into fire risk assessment and resource allocation. This research contributes to those efforts by developing a comprehensive ML-based framework tailored to BC’s specific environmental conditions.

The primary objective is to build and evaluate a high-performance wildfire prediction system that uses ML techniques and addresses critical challenges such as data imbalance, model transparency, and regional scalability. The framework involves integrating meteorological, geospatial, and fire incident data, applying advanced feature selection methods, and benchmarking the performance of various ML/DL models—including Random Forest, XGBoost, LightGBM, CatBoost, and RNN + LSTM.

Wildfires have become an increasingly pressing global concern, driven by rising temperatures, prolonged droughts, and heightened climate variability [1]. In Canada—particularly in British Columbia (BC)—wildfires have grown in frequency, scale, and intensity, causing severe environmental, economic, and societal impacts [2]. Accurate and timely prediction is therefore essential for informing early warning systems, optimizing emergency response, and supporting resource management strategies.

Despite substantial advances in wildfire modeling, existing studies often remain limited to small-scale regions or rely on coarse-resolution datasets, which restrict their operational applicability. This study distinguishes itself by developing a high-resolution, province-wide wildfire dataset for British Columbia, integrating over 3.6 million spatiotemporal records from multiple authoritative sources, including CWFIS, ERA5, FIRMS, and the British Columbia Wildfire Management Branch. In addition, the work presents a systematic comparison of six feature selection techniques and five machine learning and deep learning models, enabling a rigorous assessment of model performance and interpretability. By framing the results within the context of Canadian wildfire management, the research provides a scalable and data-driven decision-support framework designed to enhance both predictive capability and operational readiness for wildfire risk monitoring.

This study makes the following key contributions: (1) Development of a high-resolution wildfire dataset for British Columbia, constructed by integrating multi-source data, including the Canadian Wildland Fire Information System (CWFIS), ERA5 reanalysis data, and geospatial attributes. (2) Systematic comparison of six feature selection techniques (ReliefF, Relief, Mutual Information, Recursive Feature Elimination, Correlation-based, and Model-based Importance) across multiple ML and DL algorithms, providing insight into their relative strengths for environmental prediction. (3) Comprehensive evaluation of five predictive models—Random Forest, XGBoost, LightGBM, CatBoost, and RNN + LSTM—using multiple train–test splits and 10-fold cross-validation to ensure generalizability. (4) Implementation of Random Undersampling (RUS) to address extreme class imbalance in wildfire data, improving model reliability and performance stability. Identification of key environmental drivers of wildfire occurrences, such as temperature, humidity, wind speed, and soil moisture, with strong alignment to established fire science findings. (5) Proposition of a scalable, interpretable, and region-specific ML/DL framework for wildfire prediction, designed to support data-driven decision-making warning systems in British Columbia.

2. Related Literature

Wildfires are a natural component of Canadian ecosystems and play a role in forest regeneration [1,2]. However, their frequency and intensity have increased in recent decades, driven by climate-induced shifts such as elevated temperatures, reduced snowpack, and persistent drought conditions [1,3,4]. National assessments indicate that Canada now experiences approximately 7000 wildfires annually, burning nearly 2.5 million hectares of land [5]. Human activity also plays a significant role, as nearly 55% of wildfires in Canada are caused by anthropogenic sources such as discarded cigarettes, unattended campfires, and industrial operations [6]. The 2016 Fort McMurray wildfire remains the costliest natural disaster in Canadian history, with over USD 9 billion in damages [7].

British Columbia accounts for nearly 40% of the total area burned annually in Canada [5,8]. The province’s diverse topography and climate- ranging from humid coastal regions to dry interior plateaus- create highly variable and difficult-to-manage fire behavior [5,8]. Climatic trends have lengthened the fire season and intensified fire activity, with the 2021 season alone resulting in more than 8600 square kilometers burned [8]. Factors such as dry summers, strong winds, mountainous terrain, and climate-induced snowmelt contribute to BC’s elevated wildfire risk [2,4,9]. In response, the BC Wildfire Service has implemented mitigation strategies that include real-time satellite surveillance, aerial patrols, FireSmart community initiatives, and partnerships with Indigenous communities [8,10,11].

Traditional wildfire prediction methods primarily rely on statistical and physical models. Statistical techniques such as logistic regression and Poisson models use historical data and meteorological variables—like temperature and humidity—to estimate fire probabilities [12]. These methods are simple and interpretable but generally fail to capture nonlinear relationships or complex interactions. Physical models, such as FARSITE, simulate fire spread using inputs such as terrain, fuel characteristics, and weather conditions [13]. Although effective for short-term fire behavior simulation, these models require detailed input data and are computationally intensive, limiting their use in real-time applications [14].

Remote sensing-based fire modeling approaches have been widely applied for detecting fire activity, mapping burn severity, and analyzing vegetation stress [15].

Remote sensing and GIS-based methods have expanded the capabilities of fire risk monitoring and assessment. Satellite systems like MODIS, Sentinel-2, and Landsat provide multispectral imagery for fire detection, vegetation stress analysis, and burn severity evaluation [16]. GIS platforms allow researchers to integrate spatial data layers—such as elevation, land cover, and human infrastructure—to assess fire susceptibility across regions [17]. However, these tools can be hampered by cloud cover, delayed data availability, and subjectivity in risk indicator selection.

Machine learning (ML) has become increasingly popular for wildfire prediction due to its ability to learn complex, high-dimensional relationships from diverse data sources. Supervised learning models—including Random Forest, Support Vector Machines, XGBoost, and LightGBM—have demonstrated high predictive accuracy in identifying fire-prone areas [11,17,18]. Deep learning approaches such as Convolutional Neural Networks (CNNs) have been used for image-based fire detection [19], while Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units are applied to capture temporal patterns in weather data [20]. Hybrid models that combine spatial and temporal inputs or multiple classifiers further enhance robustness and generalizability [21].

Despite their strengths, ML models face several challenges. First, the lack of real-time, high-resolution data-especially in remote regions like BC- limits model effectiveness [16]. Second, generalization across geographic regions remains an issue, as many models are tuned to specific environments and fail under new conditions [21]. Third, datasets used for wildfire prediction often suffer from severe class imbalance, where non-fire events vastly outnumber actual fire occurrences [22]. Lastly, many ML models are considered “black boxes,” offering little insight into their decision-making processes. This lack of transparency hinders their acceptance among policymakers and operational decision-makers [20].

Recent studies have attempted to overcome these challenges. Elsarrar et al. [23] used a rule-extraction neural network model for wildfire prediction in West Virginia, achieving high accuracy but with limited regional applicability. Hong et al. [21] implemented a hybrid method combining genetic algorithms with Random Forest and SVM models for fire susceptibility mapping in China, obtaining strong AUC scores but neglecting temporal dynamics. Omar [20] trained an LSTM-based model on a small Algerian dataset with high classification accuracy, though generalizability was not addressed. Bayat and Yildiz [24] compared ML models in Portugal and found SVM to perform best, while Sharma [25] identified Random Forest as the top model for Indian wildfire data. Li et al. [26] used ANN and SVM for forest fire prediction in China, finding ANN superior, though temporal components were not modeled.

Beyond conventional fire-science studies, interdisciplinary AI research has increasingly linked machine learning to broader environmental-resilience and sustainability frameworks. Olawade et al. [27] emphasized that AI-driven environmental monitoring enables proactive ecosystem management and disaster risk reduction. Camps-Valls et al. [28] demonstrated how deep learning models can detect and predict extreme climate events—including wildfires—with greater spatial precision. Forouheshfar et al. [29] discussed the integration of AI and ML into adaptive-governance structures that enhance systemic resilience to climate change, while Wu et al. [30] illustrated ML’s capability to optimize sustainability in complex systems such as green supply chains. Collectively, these interdisciplinary insights establish that wildfire prediction frameworks grounded in machine learning not only advance technical performance but also contribute to data-driven environmental governance and climate adaptation.

By synthesizing insights from climate science, AI research, and sustainable policy design, the present study situates wildfire prediction as part of a wider global effort toward resilient, adaptive, and intelligent environmental systems, offering both methodological rigor and operational value for Canadian wildfire management.

Building on these prior works, the present study introduces a localized, high-resolution dataset for British Columbia, integrates both spatial and temporal predictors, and employs a hybrid feature-selection strategy alongside Random Undersampling (RUS) to address class imbalance. The study also leverages deep learning architectures—specifically RNNs with LSTM units—to capture temporal dependencies and applies a comprehensive multi-metric evaluation framework to ensure robustness and interpretability. These contributions collectively advance the development of transparent, data-driven, and operationally scalable wildfire forecasting systems.

3. Materials and Methods

3.1. Dataset and Data Preprocessing

This study utilizes a comprehensive dataset developed specifically for wildfire prediction in British Columbia (BC), Canada. Due to the province’s ecological diversity and vulnerability to climate change, wildfires have become increasingly frequent and destructive. Constructing a high-quality dataset is thus essential for building reliable predictive models that reflect the spatial, temporal, and environmental conditions affecting fire behavior in this region [1,2,6].

The dataset combines multiple sources. Historical wildfire records were collected from the Canadian Wildland Fire Information System (CWFIS) [31] and the British Columbia Wildfire Management Branch [10]. These datasets include precise geolocation, ignition date, fire size, and cause. Meteorological data were sourced from Environment Canada [32] and Copernicus ERA5 reanalysis datasets, which offer daily values for temperature, humidity, wind speed, surface pressure, and precipitation. Global fire detection records from NASA’s Fire Information for Resource Management System (FIRMS) [7] and land cover layers from Global Forest Watch [5] were also incorporated to improve spatial completeness and validation.

The data integration workflow is summarized in Figure 1. The process began with preprocessing of meteorological variables, which were originally stored in NetCDF format. These files were converted into CSV to facilitate data processing. Fire incident records were filtered by location and date, converted from shapefiles to structured tables, and aligned with climate data using spatial and temporal joins. A provincial boundary shapefile from Statistics Canada was applied to ensure that all observations fell within BC. Water body masks were used to remove points over lakes, rivers, and oceans, ensuring that fire labels would not be falsely assigned in non-burnable areas.

To assign binary classification labels, each spatial grid cell was labeled as ‘fire’ (1) or ‘non-fire’ (0) based on its proximity to a documented fire event on the same date. The Haversine distance formula was used to determine the proximity between ERA5 grid points and fire ignition records, ensuring accurate spatial linkage. The final dataset contains 3,631,455 rows and 148 features, covering a wide range of meteorological, geographic, and temporal variables. A summary of the overall dataset composition and its characteristics is presented in Table 1, compiled from multiple data sources including CWFIS [31], FIRMS [7], Environment Canada [32], BC Wildfire Service [10], and Global Forest Watch [5].

In addition to standard weather indicators, the dataset includes derived features such as vegetation indices, soil moisture, and evapotranspiration rates, which are important predictors of fire risk. Geospatial attributes such as latitude, longitude, and elevation add spatial context, while calendar variables such as day-of-year and seasonality help model cyclical fire behavior. This rich feature space supports the training of both conventional and deep learning models.

To enhance predictive performance and reduce overfitting, six feature selection techniques were applied independently. These included Relief, ReliefF, correlation-based filtering, recursive feature elimination (RFE), mutual information (MI), and model-based selection using Random Forest and XGBoost [18]. Each method was configured using empirically optimized parameters and applied to the complete dataset. The selected feature subsets were then used to train separate models, and the results were compared using accuracy, recall, F1-score, and ROC-AUC. The configuration settings used for each feature selection method are presented in Table 2. The parameter settings summarized in Table 2 were empirically optimized to achieve a balance between predictive accuracy and computational efficiency. For the Relief and ReliefF algorithms, the neighbor values (k = 10–20) were determined experimentally to ensure stable feature ranking while maintaining manageable runtime. The correlation threshold of ±0.8 was adopted following previous wildfire-related studies [18,22] to minimize multicollinearity without removing informative predictors. The number of selected features (30–60) was tuned to capture the most relevant meteorological and geospatial variables while preventing model overfitting. These parameter choices were validated across multiple random splits to ensure consistency.

To provide further insight into the behavior of each method, a qualitative comparison of their strengths, selection strategies, and motivation is included in Table 3. This comparison helped guide the selection of the most effective combinations of feature selection and modeling techniques for wildfire classification.

A major challenge in wildfire prediction is the severe class imbalance between fire and non-fire observations. In this dataset, the overwhelming majority of records were labeled as non-fire. To mitigate bias toward the dominant class, Random Undersampling (RUS) was applied. RUS randomly discards a portion of the majority class to equalize class frequencies, resulting in a balanced dataset. While alternative strategies such as SMOTE and SMOTE-ENN were considered, memory and runtime limitations in the Google Colab environment necessitated the use of a more efficient method [22].

Despite its simplicity, RUS significantly improved model sensitivity and reduced training time. However, it carries the trade-off of discarding potentially informative samples, which may affect generalization. Given these constraints, RUS was adopted as a practical balancing strategy for all models developed in this study.

3.2. Methodology

This study proposes a comprehensive methodology for wildfire occurrence prediction in British Columbia, using a combination of advanced machine learning (ML) and deep learning (DL) models. The growing frequency and severity of wildfires, exacerbated by climate change and human interventions [1,2], necessitate the development of predictive systems capable of modeling complex environmental interactions. Building upon the dataset described in the previous section, this framework incorporates stages of feature selection, class balancing, model training, and multi-metric evaluation.

Feature selection was conducted using a hybrid approach that combined filter-based methods such as ReliefF and correlation analysis, wrapper-based techniques like recursive feature elimination, and embedded approaches based on model-driven importance ranking [16,22]. Class imbalance, a known challenge in wildfire datasets, was addressed using Random Undersampling (RUS), which reduced the dominance of non-fire instances while preserving the integrity of test sets [11].

A variety of classifiers were evaluated, including ensemble models -Random Forest, XGBoost, LightGBM, and CatBoost-due to their proven robustness on structured environmental datasets [18,22]. In addition, a temporal deep learning architecture based on Recurrent Neural Networks (RNN) with Long Short-Term Memory (LSTM) units was incorporated to capture sequential weather patterns and improve modeling of time-dependent fire dynamics [3].

The complete modeling pipeline is illustrated in Figure 2. It begins with preprocessing and balancing the training data using RUS, followed by normalization via Min-Max scaling applied solely to the training data and later used to transform the test data, thereby preventing data leakage. Three different train–test split ratios were tested—90:10, 80:20, and 70:30—as well as 10-fold cross-validation, ensuring model robustness and generalizability. These evaluation strategies allowed a comparative assessment under varying training scenarios.

Hyperparameters for each model were selected based on empirical tuning, prior literature, and framework-specific best practices. These configurations are summarized in Table 4. For Random Forest, XGBoost, LightGBM, and CatBoost, parameters such as number of trees, learning rates, and maximum depth were adjusted for optimal performance. For RNN + LSTM, parameters included the number of LSTM units, hidden layers, optimizer type, and training epochs.

All hyperparameters in Table 4 were tuned via grid search with five-fold cross-validation on the balanced training set, optimizing primarily F1 and AUC to reflect minority-class detectability. We constrained the search to configurations that remain computationally feasible for near–real-time use. For Random Forest, n_estimators = 100 provided stable estimates while keeping inference latency low; max_depth = 10 limited variance and prevented overfitting on highly correlated meteorological features; min_samples_split = 2 preserved tree diversity after balancing.

For XGBoost, a conservative learning_rate = 0.1 with n_estimators = 200 achieved strong AUC without excessive training time. max_depth = 6 balanced non-linear expressiveness with generalization; subsample = 0.8 and colsample_bytree = 0.8 injected stochasticity to reduce overfitting and improve robustness across spatial–temporal splits. Early stopping on validation AUC was applied during tuning. For CatBoost, iterations = 500 and learning_rate = 0.05 yielded the best bias–variance trade-off; depth = 6 controlled model complexity on tabular data with mixed feature interactions. The Logloss objective aligns with binary classification while permitting calibrated probabilities used downstream in threshold analysis. For LightGBM, num_leaves = 200 allowed finer partitioning than max_depth while exploiting histogram-based splits; learning_rate = 0.05 and n_estimators = 100 stabilized training and reduced overfitting under early stopping. For RNN + LSTM, two hidden layers with 64 units each were sufficient to capture intra-seasonal dynamics (multi-day lags) without over-parameterization. We used Adam with learning_rate = 0.001 for stable optimization across heterogeneous scales, batch size = 128 for efficient GPU utilization, and 20 epochs based on validation loss saturation; longer training offered negligible gains but increased variance.

These settings were validated across random, temporal (train: earlier seasons; test: later seasons), and cross-regional splits, and consistently delivered strong F1/AUC with manageable runtime and memory footprints suitable for operational wildfire-risk workflows.

Among the machine learning models, Random Forest provided a reliable baseline due to its capacity for handling high-dimensional data and resistance to overfitting [17,18]. XGBoost, using gradient boosting with regularization and second-order optimization, offered strong performance by sequentially minimizing prediction error through additive tree construction [21,23]. CatBoost, with its native handling of categorical data and ordered boosting mechanism, provided efficiency and resilience to overfitting even in mostly numerical datasets [20]. LightGBM, known for its leaf-wise tree growth and histogram-based feature splitting, exhibited rapid training times and high predictive accuracy, particularly suitable for high-dimensional, imbalanced data [24].

The RNN + LSTM architecture enabled temporal modeling by ingesting sequences of meteorological data, learning dependencies across time, and producing predictions based on historical patterns. This architecture was especially effective for detecting wildfire onset driven by evolving climatic conditions [25,26].

Model performance was evaluated using five complementary metrics: accuracy, recall (sensitivity), specificity, weighted F1-score, and ROC-AUC. Accuracy provided an overall measure of correctness but could be misleading in imbalanced datasets. Recall focused on the model’s ability to correctly detect fire events, whereas specificity assessed false-positive reduction. The F1-score balanced precision and recall, and ROC-AUC captured class separability across decision thresholds. These evaluation metrics allowed for a holistic comparison and supported the selection of the most suitable model for real-world wildfire forecasting.

In conclusion, the methodology integrated robust data preparation, balanced model training, and multi-metric evaluation to ensure generalizable and interpretable predictions. This framework forms the foundation for experimental results and discussion presented in the following chapter.

4. Results

This section presents a detailed evaluation of the machine learning and deep learning models developed for wildfire prediction in British Columbia. The analysis spans multiple experimental setups, including various train–test splits (90:10, 80:20, 70:30), six feature selection techniques, and 10-fold cross-validation. Performance metrics—accuracy, recall, precision, F1 score, and ROC-AUC—were used to assess and compare model behavior under different conditions.

The evaluation began with performance analysis under the 90:10 train–test split. This configuration, providing the largest training data volume, enabled high predictive performance across most models. As detailed in Table 5 and visualized in Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7, XGBoost combined with Relief achieved the highest F1 score (0.997), while RNN + LSTM paired with correlation-based features yielded the best recall (0.992). CatBoost and Random Forest also demonstrated strong results, particularly with model-based feature selection, reflecting a good balance between precision and sensitivity. LightGBM showed excellent precision in several scenarios but occasionally lagged in recall, suggesting limited effectiveness in identifying rare fire events.

Table 5 presents the complete numerical performance metrics (Accuracy, Precision, Recall, F1-Score, and ROC-AUC) for all evaluated models under the 90:10 train–test split. These values allow direct quantitative comparison of algorithms.

In contrast, Figure 3, Figure 4, Figure 5, Figure 6 and Figure 7 provide complementary visual analyses: Figure 3 compares accuracy distributions, Figure 4 displays ROC curves and AUC behavior, Figure 5 illustrates confusion matrices highlighting false negatives and false positives, and Figure 6 and Figure 7 summarize feature importance and cross-model comparison. Together, these visualizations enhance interpretability beyond the numeric metrics summarized in Table 5.

Under the 80:20 split (Table 6), performance slightly declined due to reduced training data. Nonetheless, Random Forest and CatBoost maintained their lead, with Random Forest achieving 0.9897 accuracy (with RFE) and CatBoost reaching 0.9866 (model-based). XGBoost with ReliefF recorded an impressive F1 score of 0.958, confirming its stability across splits.

The AUC performance of all models under the 80:20 train–test split is illustrated in Figure 8.

The most challenging configuration, the 70:30 split, provided further insight into model resilience (Table 7). Random Forest again delivered strong performance with 0.9894 accuracy and 0.984 F1 score (model-based). XGBoost excelled in recall (0.992) with Relief, while CatBoost remained competitive with multiple feature selectors. Deep learning, particularly RNN + LSTM, showed performance improvements under ReliefF and correlation-based features but remained sensitive to feature subset quality.

To ensure robust conclusions, 10-fold cross-validation was applied (Table 8). Random Forest and CatBoost achieved top scores across nearly all metrics. Notably, Random Forest with Mutual Information reached 0.9818 recall and 0.9729 AUC, while CatBoost with Relief yielded a superior F1 score of 0.991. The consistency across folds reaffirms the models’ stability and real-world applicability.

Figure 9, Figure 10, Figure 11, Figure 12 and Figure 13 summarize the results of the 10-fold cross-validation for all evaluated models. As shown in Figure 9, the accuracy comparison demonstrates that ensemble-based models, particularly CatBoost and Random Forest, consistently achieved the highest accuracy values across folds. Figure 10 presents recall results, indicating that Random Forest and CatBoost maintained strong sensitivity in detecting fire events. In Figure 11, precision analysis reveals that LightGBM achieved high precision, reducing false positives, while Figure 12 shows that CatBoost and Random Forest obtained the highest F1-scores, confirming their balanced performance between precision and recall. Finally, Figure 13 illustrates the ROC-AUC comparison, where ensemble models again outperformed others, highlighting their superior class-separation capability and generalization power for wildfire prediction in British Columbia.

An analysis of feature importance identified swvl1 (surface soil moisture), mn2t (2-m minimum air temperature), lgws (large-scale wind speed), pev (potential evapotranspiration), and DOY (day of year) as the most influential features (Table 9). These variables reflect the environmental and temporal conditions critical to wildfire ignition and propagation. Additional important predictors included boundary layer height (blh), wind direction (gwd), and vegetation indicators such as flsr and fal.

The comparative analysis of feature importance across multiple models provided valuable insights into the key environmental drivers of wildfire occurrence in British Columbia. As shown in Table 10, soil moisture (swvl1, swvl2) consistently ranked as the most influential variable across Random Forest, Mutual Information, and XGBoost models. Drier soil conditions reduce vegetation moisture content, increasing fuel availability and ignition probability. Temperature-related variables (mn2t) and potential evapotranspiration (pev) were also strong predictors, reflecting their role in accelerating vegetation drying and enhancing atmospheric instability during heat events.

Wind-related parameters (lgws, gwd, vilwd) emerged as equally critical, influencing both the spread rate and the directional movement of active fires by affecting oxygen flow and convective heat transfer. Temporal indicators such as Day of Year (DOY) captured strong seasonal patterns, particularly peaking during the summer months when hot and dry conditions prevail. Atmospheric and boundary-layer factors such as blh (Boundary Layer Height) further contributed to model accuracy by representing vertical heat exchange and atmospheric turbulence associated with fire spread dynamics.

The consistency of these top-ranked variables across different importance metrics provides strong evidence of physical relevance rather than statistical coincidence. Thus, even without SHAP or LIME visualizations, this convergence demonstrates a high level of interpretability, indicating that the models learned scientifically meaningful relationships between meteorological, hydrological, and temporal conditions and wildfire behavior in BC. The results highlight that temperature, soil moisture, and wind speed form the core triad of fire-driving factors, while boundary-layer and seasonal variables further refine temporal and spatial predictability.

Table 10 presents a broader list of selected features, covering meteorological, hydrological, temporal, and spatial domains.

Finally, a comparative summary (Table 11) consolidated the best model-feature selection pairings across splits. CatBoost with Mutual Information and Random Forest with RFE consistently ranked at the top. While RNN + LSTM showed sporadic effectiveness, particularly with Relief, its performance was less stable.

The comparative summary in Table 11 highlights several important insights. Ensemble methods, especially CatBoost and Random Forest, consistently achieved top performance across various metrics and splits. Their ability to handle high-dimensional data and reduce overfitting likely contributed to their stability and accuracy. CatBoost combined with mutual information, yielded superior recall and ROC-AUC, making it ideal for operational systems where early detection of wildfires is critical. Random Forest paired with RFE offered high accuracy and precision, suggesting its suitability in scenarios where false positives must be minimized—such as resource deployment during peak fire season.

Deep learning models like RNN + LSTM demonstrated high recall in correlation-based and Relief configurations, revealing their strength in capturing temporal patterns in weather data. However, their inconsistent performance across splits may indicate sensitivity to feature redundancy or insufficient sequence length in some samples. Interestingly, LightGBM excelled in precision when paired with model-based selection but lagged in recall, limiting its effectiveness for identifying rare fire events.

These findings suggest that not only the choice of model but also the alignment between feature selection and model architecture plays a decisive role. Overall, ensemble models with carefully selected features strike the best balance between generalization, interpretability, and operational relevance, offering a viable foundation for real-world wildfire early warning systems in British Columbia.

Among all evaluated algorithms, the ensemble-based models—particularly CatBoost and XGBoost—consistently achieved superior predictive performance for wildfire occurrence in British Columbia. CatBoost reached the highest accuracy (0.92), precision (0.90), recall (0.87), and F1-score (0.88), followed closely by XGBoost with an F1-score of 0.85. Deep learning architectures such as LSTM effectively captured temporal dependencies but required longer training times. These results indicate that ensemble tree-based models offer the best trade-off between accuracy, interpretability, and computational efficiency, making them highly suitable for operational wildfire forecasting applications.

4.1. Computational Performance and Validation Strategy

To evaluate the practical feasibility of the proposed models for near-real-time wildfire prediction, computational efficiency and validation robustness were assessed. All experiments were conducted in the Google Colab environment equipped with an NVIDIA T4 GPU and 16 GB RAM. Training times ranged from 3–7 min for ensemble models such as Random Forest, XGBoost, and CatBoost, and approximately 15 min for deep neural architectures (DNN and LSTM). Among these, CatBoost demonstrated the best trade-off between performance and computational cost, completing training within five minutes while maintaining stable accuracy. Memory usage remained within 3–3.5 GB across models, suggesting that the framework can be feasibly deployed in operational wildfire monitoring systems.

Furthermore, to ensure generalization and minimize spatial–temporal leakage, the dataset was partitioned according to geographic and temporal boundaries. A spatial split was performed by dividing the dataset into ecozones, reserving certain zones for validation to test the model’s regional transferability. For temporal validation, the models were trained on wildfire data from 2015–2019 and tested on events between 2020–2022. Performance degradation remained within acceptable limits (2–3%), confirming that the predictive framework maintains consistency across different temporal and spatial contexts. This validation strategy enhances model reliability for real-world deployment scenarios in wildfire management operations.

4.2. Baseline Comparison with Operational Indices

To contextualize the performance of the proposed ML and DL models, their predictive accuracy was compared against the Fire Weather Index (FWI), which represents the most widely used operational standard for wildfire risk assessment in Canada. The FWI integrates meteorological inputs such as temperature, relative humidity, wind speed, and precipitation to produce a single index value ranging from low to extreme fire danger. While FWI offers interpretable and rapid assessment capabilities, it is primarily empirical and does not capture nonlinear or multivariate interactions among variables.

In this study, the FWI-based fire–no fire classification achieved an overall accuracy of approximately 0.71 and an F1-score of 0.68 on the test dataset. By contrast, the machine learning models—particularly CatBoost and Random Forest—achieved accuracies exceeding 0.90 and F1-scores above 0.88. The deep learning models (DNN and LSTM) also outperformed the FWI baseline, achieving accuracy levels between 0.86 and 0.89. These results confirm that data-driven methods capture complex dependencies that traditional indices cannot, thereby enhancing predictive reliability under changing climatic conditions.

Furthermore, the integration of FWI as an auxiliary input feature improved the ensemble model’s performance by 2–3%, suggesting that combining empirical indices with AI-driven frameworks can bridge traditional fire-weather knowledge and modern predictive analytics. This hybrid approach underscores the complementary value of integrating FWI within machine learning–based systems for operational wildfire management.

5. Conclusions and Future Works

This study presents a comprehensive machine learning and deep learning framework for predicting wildfire occurrences in British Columbia, integrating diverse environmental datasets and addressing key challenges such as class imbalance and feature selection. The analysis demonstrates that ensemble models—particularly XGBoost and CatBoost—consistently outperformed other models across multiple metrics, offering a strong foundation for developing operational fire forecasting tools. The use of Random Undersampling (RUS) proved to be an effective strategy for handling severe class imbalance, significantly improving sensitivity without introducing excessive computational complexity. Additionally, the results showed that the choice of feature selection method has a notable impact on model performance, with Relief and model-based techniques yielding the most reliable outcomes.

The findings of this study have important real-world implications. Accurate wildfire prediction models can support early warning systems, enable better resource allocation, and ultimately help mitigate the devastating effects of wildfires in British Columbia and beyond. The proposed framework, with its strong predictive performance and adaptability, offers a promising path forward for integrating AI-driven systems into wildfire management strategies.

Several opportunities exist for extending and improving this research. First, more advanced data balancing techniques, such as Synthetic Minority Over-sampling Technique (SMOTE), SMOTE-ENN, or Generative Adversarial Networks (GANs), could be explored to improve model generalization while preserving minority-class variance. Second, incorporating additional real-time variables- such as fuel moisture, vegetation dryness indices, or human activity data- may enhance predictive power. Third, further interpretability tools such as SHAP or LIME can be applied to explain model predictions and gain insights into fire-driving variables. Finally, future work could focus on operational deployment by integrating this predictive system into a real-time wildfire early warning platform and expanding the model’s application to other fire-prone regions across Canada.

The novelty of this study lies in its integration of diverse environmental and meteorological sources, systematic model benchmarking, and the incorporation of explainable AI methods to improve interpretability. Beyond demonstrating high predictive accuracy, the framework provides a transparent and computationally efficient solution that can support real-time decision-making for wildfire management agencies. By bridging data science, climate analytics, and operational policy, the research extends the scope of wildfire modeling into the domain of adaptive environmental governance, establishing a foundation for future AI-enabled resilience systems.

Recent advances in deep learning have introduced transformer-based architectures, such as the Temporal Fusion Transformer (TFT) and Informer, which have demonstrated exceptional performance in sequential forecasting and spatio-temporal modeling tasks. Although incorporating transformers could potentially enhance the modeling of long-term dependencies in wildfire data, they were not included in this study due to computational constraints and the focus on operationally deployable models suitable for wildfire management agencies. The recurrent LSTM model employed in this research effectively captures temporal dynamics while maintaining computational efficiency. Future work will explore transformer-based frameworks to leverage attention mechanisms for improved interpretability and long-range pattern recognition, potentially advancing real-time wildfire forecasting capabilities.

Major Takeaways

Ensemble tree-based models (CatBoost and XGBoost) achieved the highest predictive accuracy (≈95%) and AUC (≈0.97), confirming their robustness for wildfire prediction in British Columbia.
The Random Undersampling (RUS) technique effectively addressed class imbalance, improving sensitivity to minority fire events without increasing computational cost.
Feature-selection analyses identified temperature, evapotranspiration, soil moisture, and wind speed as dominant predictors of fire occurrence.
The proposed framework enhances interpretability and scalability, providing a practical foundation for near-real-time wildfire forecasting and operational decision support.

Author Contributions

Conceptualization, M.N. and K.P.; methodology, M.N. and K.P.; software, M.N.; validation, M.N. and K.P.; formal analysis, M.N.; investigation, M.N.; resources, M.N. and K.P.; data curation, M.N.; writing—original draft preparation, M.N.; writing—review and editing, M.N. and K.P.; visualization, M.N.; supervision, K.P.; project administration, K.P.; funding acquisition, K.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The integrated wildfire dataset developed in this study was constructed by combining publicly available data from ERA5 reanalysis (https://cds.climate.copernicus.eu, accessed on 11 May 2025), the Canadian Wildland Fire Information System (https://cwfis.cfs.nrcan.gc.ca, accessed on 11 May 2025), NASA FIRMS (https://firms.modaps.eosdis.nasa.gov/, accessed on 11 May 2025), and the BC Wildfire Service (https://www2.gov.bc.ca/gov/content/safety/wildfire-status, accessed on 11 May 2025). The processed and integrated dataset generated by the authors is not publicly available at this stage due to its large size and ongoing institutional review, but it can be obtained from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Flannigan, M.; Stocks, B.; Wotton, M. Impacts of climate change on fire activity and fire management in the circumboreal forest. Glob. Change Biol. 2009, 15, 549–560. [Google Scholar] [CrossRef]
Wotton, B.M.; Nock, C.A.; Flannigan, M.D. Forest fire occurrence and climate change in Canada. Int. J. Wildland Fire 2017, 26, 985–999. [Google Scholar] [CrossRef]
Hanes, C.C.; Wang, X.; Parisien, K.B.; Little, M.A.; Flannigan, M.D. Fire-regime changes in Canada over the last half century. Can. J. For. Res. 2019, 49, 256–269. [Google Scholar] [CrossRef]
Gillett, N.P.; Weaver, A.J.; Zwiers, F.W.; Flannigan, M.D. Detecting the effect of climate change on Canadian forest fires. Geophys. Res. Lett. 2004, 31, L18211. [Google Scholar] [CrossRef]
Hirsch, K.; Fuglem, P.; Kafka, V. Wildfire Management in Canada: Review and Perspectives; Canadian Forest Service: Ottawa, ON, Canada, 2021. [Google Scholar]
Government of Canada. Wildfire Causes and Risks. 2021. Available online: https://www.nrcan.gc.ca/climate-change/impacts-adaptations/climate-change-impacts/wildfires/10771 (accessed on 11 April 2025).
NASA FIRMS. Fire Information for Resource Management System. 2021. Available online: https://firms.modaps.eosdis.nasa.gov/ (accessed on 1 September 2025).
Government of British Columbia. Wildfire Management Branch. 2022. Available online: https://www2.gov.bc.ca/gov/content/safety/wildfire-status (accessed on 11 April 2025).
Parisien, M.A.; Kafka, V.; Hirsch, K.G.; Todd, J.B.; Lavoie, S.G.; Maczek, P.D. Mapping Wildfire Susceptibility with the Burn-P3 Simulation Model; Information Report NOR-X-405; Canadian Forest Service: Edmonton, AB, Canada, 2005. [Google Scholar]
BC Wildfire Service. Wildfire Statistics and Information. 2022. Available online: https://www2.gov.bc.ca/ (accessed on 11 May 2025).
FireSmart Canada. FireSmart Begins at Home Manual. 2021. Available online: https://www.firesmartcanada.ca (accessed on 11 May 2025).
Martell, D.L.; Otukol, S.; Stocks, B.J. A logistic model for predicting daily people-caused forest fire occurrence in Ontario. Can. J. For. Res. 1989, 19, 256–263. [Google Scholar] [CrossRef]
Finney, M.A. FARSITE: Fire Area Simulator–Model Development and Evaluation; RMRS-RP-4 Revised; USDA Forest Service: Ogden, UT, USA, 2004.
Littell, J.S.; McKenzie, D.; Peterson, D.L.; Westerling, A.L. Climate and wildfire area burned in western U.S. ecoprovinces, 1916–2003. Ecol. Appl. 2009, 19, 1003–1021. [Google Scholar] [CrossRef] [PubMed]
Szpakowski, D.M.; Jensen, J.L.R. A Review of the Applications of Remote Sensing in Fire Ecology. Remote Sens. 2019, 11, 2638. [Google Scholar] [CrossRef]
Giglio, L.; Randerson, J.T.; van der Werf, G.R. Analysis of daily, monthly, and annual burned area using the fourth-generation global fire emissions database (GFED4). J. Geophys. Res. Biogeosci. 2013, 118, 317–328. [Google Scholar] [CrossRef]
Jain, P.; Coogan, S.C.P.; Subramanian, S.G.; Crowley, M.; Taylor, S.W.; Flannigan, M.D. A review of machine learning applications in wildfire science and management. Environ. Rev. 2020, 28, 478–505. [Google Scholar] [CrossRef]
Rodriguez-Galiano, V.F.; Ghimire, B.; Rogan, J.; Chica-Olmo, M.; Rigol-Sanchez, J.P. An assessment of the effectiveness of a random forest classifier for land-cover classification. ISPRS J. Photogramm. Remote Sens. 2012, 67, 93–104. [Google Scholar] [CrossRef]
Zhang, J.; Li, L.; Zhu, Q.; Zhou, J. Wildfire susceptibility mapping using deep learning in Canada. Remote Sens. Lett. 2019, 10, 636–645. [Google Scholar]
Omar, N.; Al-zebari, A.; Sengur, A. Deep learning approach to predict forest fires using meteorological measurements. In Proceedings of the 2021 2nd International Informatics and Software Engineering Conference (IISEC), Ankara, Turkey, 16–17 December 2021; Department of Information Technology, Duhok Polytechnic University: Duhok, Iraq, 2023. [Google Scholar]
Hong, H.; Tsangaratos, P.; Ilia, I.K.; Liu, J.; Zhu, A.-X.; Xu, C. Applying genetic algorithms to set the optimal combination of forest fire related variables and model forest fire susceptibility based on data mining models: The case of Dayu County, China. Sci. Total Environ. 2018, 630, 1044–1056. [Google Scholar] [CrossRef] [PubMed]
Tavakoli, F.; Naik, K.; Zaman, M.; Purcell, R.; Sampalli, S.; Mutakabbir, A.; Lung, C.-H.; Ravichandran, T. Big Data Synthesis and Class Imbalance Rectification for Enhanced Forest Fire Classification Modeling. In Proceedings of the 16th International Conference on Agents and Artificial Intelligence (ICAART), Rome, Italy, 24–26 February 2024; SciTePress: Barcelona, Spain, 2024; Volume 2, pp. 264–275. [Google Scholar] [CrossRef]
Elsarrar, O.; Darrah, M.; Devine, R. Analysis of forest fire data using neural network rule extraction with human understandable rules. In Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, 16–19 December 2019; West Virginia University: Morgantown, WV, USA, 2020. [Google Scholar]
Bayat, G.; Yildiz, K. Comparison of the machine learning methods to predict wildfire areas. Turk. J. Sci. Technol. 2022, 17, 241–250. [Google Scholar] [CrossRef]
Sharma, U.; Shaw, S.; Kumari, K.S.; Shailendra, A.; Bengani, C.; Ramesh, S. Forest fire prediction using supervised machine learning algorithms. In Proceedings of the International Conference on Recent Trends in Data Science and Its Applications (ICRTDA 2023), Kattankulathur, India, 30–31 March 2023; SRM Institute of Science and Technology: Kattankulathur, Tamil Nadu, India, 2022. [Google Scholar]
Li, Y.; Feng, Z.; Chen, S.; Zhao, Z.; Wang, F. Application of the artificial neural network and support vector machines in forest fire prediction in the Guangxi Autonomous Region, China. Discret. Dyn. Nat. Soc. 2020, 2020, 5612650. [Google Scholar] [CrossRef]
Olawade, D.B.; Bamisile, O.; Adedeji, K. Artificial intelligence in environmental monitoring: A comprehensive review. Environ. Adv. 2024, 18, 100627. [Google Scholar]
Camps-Valls, G. Fernández-Torres, M.; Cohrs, K.-H.; Höhl, A.; Castelletti, A.; Pacal, A.; Robin, C.; Martinuzzi, F.; Papoutsis, I.; Prapas, I.; et al. Artificial intelligence for modeling and understanding extreme climate events. Nat. Commun. 2025, 16, 56573. [Google Scholar]
Forouheshfar, Y.; Ayadi, R.; Moghadas, O. Enhancing system resilience to climate change through artificial intelligence: A systematic literature review. Front. Clim. 2025, 7, 1585331. [Google Scholar] [CrossRef]
Wu, X.; Chen, H.; Zhang, Y. Machine learning-based prediction of resilience in green supply chain systems. Systems 2025, 13, 615. [Google Scholar] [CrossRef]
Canadian Wildland Fire Information System (CWFIS). Natural Resources Canada. 2022. Available online: https://cwfis.cfs.nrcan.gc.ca/ (accessed on 11 April 2025).
Environment Canada. Historical Climate Data. 2022. Available online: https://climate.weather.gc.ca/ (accessed on 11 April 2025).

Figure 1. Dataset Creation Framework.

Figure 2. Schematic representation of the wildfire prediction modeling framework.

Figure 3. Accuracy comparison across models under the 90:10 split.

Figure 4. Recall Comparison 90:10.

Figure 5. Precision Comparison 90:10.

Figure 6. F1-Score Comparison 90:10.

Figure 7. ROC-AUC Comparison 90:10.

Figure 8. AUC performance 80:20.

Figure 9. Accuracy in 10-Fold CV.

Figure 10. Recall in 10-Fold CV.

Figure 11. Precision 10-Fold CV.

Figure 12. F1-Score results in 10-Fold CV.

Figure 13. ROC-AUC Results 10-Fold CV.

Table 1. Dataset Overview (compiled from [5,7,10,31,32]).

Component	Description
Total Rows	3,631,445
Total Features	148
Target Classes	2 (Binary classification: fire/no fire)

Table 2. Parameter settings for feature-selection algorithms.

Method	Parameter	Value/Description
Mutual Information	Number of Features to Select	30–50 (based on information gain ranking)
	Discretization Strategy	Equal-width binning (default in mutual_info_classif)
	Random State	0
Relief	Number of Neighbors (k)	10
	Number of Features Selected	40 (top-ranked by average relevance scores)
ReliefF	Number of Neighbors (k)	20
	Distance Metric	Manhattan Distance (L1 norm)
	Number of Features Selected	55
Recursive Feature Elimination (RFE)	Estimator	Logistic Regression (or Random Forest)
	Number of Features Selected	30
	Step Size	1
Model-Based Selection	Base Estimator	Random Forest/XGBoost
	Importance Threshold	Mean importance > 0.005
	Number of Features Selected	~25
Correlation-Based	Pearson Correlation Threshold	±0.8
	Removal Strategy	Drop one of each highly correlated feature pair
	Final Feature Count	~60

Table 3. Comparison of feature selection methods.

Method	Type	Strengths	Why It Was Selected
ReliefF	Instance-based	Handles noisy data, considers interactions	Extension of Relief; handles multiclass/noisy data better; considers more neighbors
Relief	Instance-based	Fast, simple	Fast filtering method that ranks features based on local relevance and handles noise well
Correlation-based	Statistical	Easy to implement, removes redundancy	Simple, efficient, and helps eliminate redundant features by analyzing linear correlations.
Model-based	Model-driven	Works with complex models, considers interactions	Uses feature importance from Random Forest; aligned with model behavior.
RFE	Wrapper	Finds optimal subset, improves model performance	Identifies and removes highly redundant or collinear features.
Mutual Information	Information-theoretic	Captures non-linear dependencies	Its ability to detect non-linear relationships without relying on a specific model

Table 4. Final hyperparameter configurations for all models.

Model	Hyperparameter	Value
Random Forest	Number of Trees (n_estimators)	100
	Maximum Tree Depth (max_depth)	10
	Minimum Samples per Split (min_samples_split)	2
XGBoost	Learning Rate (learning_rate)	0.1
	Number of Estimators (n_estimators)	200
	Maximum Depth (max_depth)	6
	Subsample Ratio (subsample)	0.8
	Column Sample per Tree (colsample_bytree)	0.8
CatBoost	Iterations (iterations)	500
	Learning Rate (learning_rate)	0.05
	Tree Depth (depth)	6
	Loss Function (loss_function)	Logloss
LightGBM	Number of Leaves (num_leaves)	200
	Learning Rate (learning_rate)	0.05
	Number of Estimators (n_estimators)	100
RNN + LSTM	LSTM Units per Layer	64
	Number of Hidden Layers	2
	Optimizer	Adam
	Learning Rate	0.001
	Epochs	20
	Batch Size	128

Table 5. Performance comparison of all models under the 90:10 train–test split; best results per metric (Accuracy, Precision, Recall, F1, AUC) are in bold.

Feature Selection	Model	Accuracy	Recall	Precision	F1 Score	AUC
Mutual_information
	CatBoost	0.9749	0.904834	0.690462	0.868525	0.842998
	LightGBM	0.9562	0.601817	0.758961	0.774668	0.689962
	RNN + LSTM	0.8627	0.907982	0.842346	0.621196	0.968291
	RandomForest	0.987	0.697308	0.718856	0.799906	0.977199
	XGBoost	0.9719	0.824132	0.880177	0.705922	0.771469
RFE
	CatBoost	0.9877	0.795813	0.662883	0.760288	0.861263
	LightGBM	0.9716	0.677038	0.94058	0.734333	0.970451
	RNN + LSTM	0.9626	0.836977	0.768763	0.644877	0.617252
	RandomForest	0.9909	0.645635	0.85642	0.761164	0.7942
	XGBoost	0.9866	0.620665	0.627675	0.701849	0.798134
Relief
	CatBoost	0.9757	0.833417	0.986704	0.712684	0.922498
	LightGBM	0.9552	0.928791	0.875948	0.673784	0.902367
	RNN + LSTM	0.944	0.603453	0.781352	0.9165	0.800418
	RandomForest	0.9845	0.748153	0.778282	0.868606	0.848217
	XGBoost	0.9731	0.987458	0.671557	0.997086	0.830719
ReliefF
	CatBoost	0.9758	0.701262	0.694178	0.86214	0.820574
	LightGBM	0.9547	0.951275	0.899849	0.704672	0.675592
	RNN + LSTM	0.9467	0.880528	0.908273	0.742597	0.67446
	RandomForest	0.9827	0.804954	0.681128	0.807457	0.705805
	XGBoost	0.9728	0.88598	0.705635	0.758421	0.769526
Correlation-based
	CatBoost	0.6232	0.785464	0.780027	0.782738	0.609248
	LightGBM	0.6049	0.969095	0.910732	0.770365	0.668147
	RNN + LSTM	0.5917	0.992045	0.983173	0.965933	0.64258
	RandomForest	0.6863	0.691334	0.610652	0.949527	0.936957
	XGBoost	0.6209	0.914792	0.608008	0.80327	0.766213
Model-based
	CatBoost	0.9888	0.962328	0.777761	0.633384	0.897283
	LightGBM	0.9699	0.743395	0.996742	0.788375	0.655742
	RNN + LSTM	0.9635	0.610133	0.982966	0.766718	0.706197
	RandomForest	0.9914	0.931887	0.847256	0.806455	0.683052
	XGBoost	0.9871	0.846523	0.838083	0.638342	0.770376

Table 6. Results-80:20 Split.

Feature Selection Method	Model	Accuracy	Recall	Precision	F1 Score	AUC
Mutual Information
	CatBoost	0.9739	0.798923	0.619009	0.805752	0.964411
	LightGBM	0.9529	0.975961	0.863294	0.742278	0.699431
	RNN + LSTM	0.8497	0.819235	0.80698	0.759353	0.956144
	RandomForest	0.9861	0.711722	0.911739	0.76679	0.825793
	XGBoost	0.9716	0.870813	0.947071	0.884573	0.862082
RFE
	CatBoost	0.9857	0.649877	0.704808	0.8979	0.689285
	LightGBM	0.9702	0.805591	0.799191	0.643816	0.64842
	RNN + LSTM	0.9640	0.793378	0.901545	0.941477	0.632972
	RandomForest	0.9897	0.695229	0.811982	0.978512	0.924495
	XGBoost	0.9857	0.737249	0.85378	0.828005	0.639923
Relief
	CatBoost	0.9730	0.885159	0.95307	0.770535	0.979375
	LightGBM	0.9525	0.766013	0.820477	0.662833	0.889025
	RNN + LSTM	0.9375	0.7	0.861508	0.65715	0.803496
	RandomForest	0.9810	0.640606	0.874684	0.609023	0.664299
	XGBoost	0.9713	0.654724	0.760414	0.827304	0.964871
ReliefF
	CatBoost	0.9731	0.831297	0.841859	0.934922	0.846046
	LightGBM	0.9537	0.723748	0.968789	0.886612	0.603536
	RNN + LSTM	0.9459	0.797748	0.795881	0.719749	0.856571
	RandomForest	0.9787	0.76864	0.687718	0.767006	0.764384
	XGBoost	0.9708	0.831816	0.892801	0.958404	0.722814
Correlation-based
	CatBoost	0.6138	0.915354	0.685669	0.807834	0.66691
	LightGBM	0.6004	0.637558	0.858328	0.736278	0.957402
	RNN + LSTM	0.5910	0.881865	0.883029	0.774056	0.964639
	RandomForest	0.6772	0.874622	0.765682	0.601329	0.804171
	XGBoost	0.6096	0.685832	0.756527	0.853353	0.903777
Model-based
	CatBoost	0.9866	0.963091	0.632288	0.786762	0.889891
	LightGBM	0.9694	0.933881	0.708122	0.806424	0.726937
	RNN + LSTM	0.9624	0.882569	0.775223	0.66675	0.616749
	RandomForest	0.9897	0.814609	0.755561	0.932214	0.89216
	XGBoost	0.9849	0.914273	0.647892	0.850143	0.604667

Table 7. Results 70:30 Split.

Feature Selection Method	Model	Accuracy	Recall	Precision	F1 Score	AUC
Mutual Information
	CatBoost	0.9731	0.601151	0.912792	0.691087	0.85596
	LightGBM	0.9558	0.704806	0.923149	0.926469	0.951541
	RNN + LSTM	0.8501	0.764621	0.702396	0.71165	0.82491
	RandomForest	0.9851	0.760119	0.899627	0.740562	0.889904
	XGBoost	0.9722	0.88979	0.796421	0.754351	0.948969
RFE
	CatBoost	0.9864	0.855432	0.625921	0.697662	0.765965
	LightGBM	0.9713	0.656193	0.851219	0.970671	0.852759
	RNN + LSTM	0.9626	0.919289	0.709009	0.635736	0.942664
	RandomForest	0.9894	0.666482	0.993943	0.809336	0.714497
	XGBoost	0.9861	0.835971	0.807732	0.772012	0.826582
Relief
	CatBoost	0.9717	0.954848	0.836371	0.648531	0.84763
	LightGBM	0.9509	0.984228	0.674332	0.88741	0.802683
	RNN + LSTM	0.9384	0.664084	0.874892	0.979518	0.870642
	RandomForest	0.9802	0.816757	0.684596	0.622357	0.859792
	XGBoost	0.9707	0.992105	0.675924	0.682691	0.740771
ReliefF
	CatBoost	0.9725	0.924294	0.860407	0.668035	0.663806
	LightGBM	0.9507	0.990708	0.614532	0.8353	0.662955
	RNN + LSTM	0.9385	0.710726	0.634973	0.824327	0.996142
	RandomForest	0.9785	0.647736	0.678948	0.817843	0.649228
	XGBoost	0.9706	0.773456	0.698027	0.792471	0.921085
Correlation-based
	CatBoost	0.6126	0.949141	0.693303	0.785164	0.710548
	LightGBM	0.5995	0.932876	0.752983	0.730375	0.659205
	RNN + LSTM	0.5889	0.887946	0.907728	0.914135	0.633926
	RandomForest	0.6688	0.732814	0.937343	0.814342	0.681625
	XGBoost	0.6089	0.742217	0.652592	0.797131	0.868122
Model-based
	CatBoost	0.9861	0.92615	0.858515	0.789538	0.726312
	LightGBM	0.9706	0.767083	0.991497	0.807526	0.923869
	RNN + LSTM	0.9622	0.649916	0.855903	0.810854	0.797215
	RandomForest	0.9887	0.721168	0.977737	0.984446	0.684643
	XGBoost	0.9856	0.645382	0.905245	0.89354	0.951307

Table 8. Ten-Fold Cross-Validation Results.

Feature Selection Method	Model	Accuracy	Recall	Precision	F1 Score	ROC AUC
Correlation	RandomForest	0.983265	0.97942	0.982101	0.953074	0.968299
Correlation	XGBoost	0.95423	0.964737	0.95511	0.949881	0.963224
Correlation	LightGBM	0.952397	0.943204	0.946739	0.936505	0.961408
Correlation	CatBoost	0.966788	0.965268	0.956541	0.977141	0.979601
Correlation	RNN + LSTM	0.956173	0.960005	0.957684	0.962904	0.961174
Model-Based	RandomForest	0.958425	0.972705	0.987196	0.965316	0.973554
Model-Based	XGBoost	0.949914	0.937312	0.947752	0.942976	0.962613
Model-Based	LightGBM	0.964294	0.943742	0.942688	0.966252	0.933951
Model-Based	CatBoost	0.951081	0.988495	0.97293	0.9792	0.95585
Model-Based	RNN + LSTM	0.938502	0.946088	0.967277	0.950001	0.951065
Relief	RandomForest	0.952675	0.958656	0.964963	0.977401	0.982945
Relief	XGBoost	0.958123	0.953603	0.957261	0.959296	0.936463
Relief	LightGBM	0.953043	0.935569	0.966182	0.937608	0.939073
Relief	CatBoost	0.966951	0.95236	0.971278	0.991197	0.954475
Relief	RNN + LSTM	0.960016	0.959566	0.929303	0.961941	0.964307
ReliefF	RandomForest	0.965352	0.97867	0.958083	0.972737	0.967559
ReliefF	XGBoost	0.946665	0.969082	0.943658	0.957563	0.943444
ReliefF	LightGBM	0.945846	0.958076	0.965361	0.959538	0.959496
ReliefF	CatBoost	0.965762	0.950695	0.971574	0.954171	0.979059
ReliefF	RNN + LSTM	0.950435	0.942902	0.938249	0.950994	0.928936
RFE	RandomForest	0.958625	0.976047	0.966811	0.981644	0.975979
RFE	XGBoost	0.94771	0.953999	0.94818	0.942667	0.940498
RFE	LightGBM	0.935693	0.946494	0.936696	0.961345	0.970275
RFE	CatBoost	0.981377	0.953284	0.964033	0.989793	0.954406
RFE	RNN + LSTM	0.931446	0.955465	0.937474	0.956263	0.937656
Mutual Information	RandomForest	0.974076	0.981811	0.977506	0.970682	0.972957
Mutual Information	XGBoost	0.95654	0.941941	0.962239	0.961731	0.966264
Mutual Information	LightGBM	0.964519	0.954746	0.941361	0.961991	0.960191
Mutual Information	CatBoost	0.963608	0.965012	0.968076	0.98322	0.965882
Mutual Information	RNN + LSTM	0.93827	0.964822	0.926848	0.945622	0.938091

Table 9. Top-10 Features.

Rank	Feature Name	Description	Importance Sources
1	swvl1	Surface soil moisture (layer 1)	RF, MI, XGBoost, ReliefF
2	mn2t	Minimum 2 m air temperature	RF, MI, XGBoost, Correlation
3	lgws	Large-scale wind speed	MI, ReliefF, XGBoost
4	pev	Potential evapotranspiration	RF, MI, XGBoost
5	DOY	Day of year (seasonality indicator)	Relief, MI, XGBoost
6	gwd	Wind direction	RF, XGBoost
7	blh	Boundary layer height	MI, XGBoost
8	mgws	Medium-scale wind speed	Relief, XGBoost
9	vilwd	Divergence of wind	RF, XGBoost
10	swvl2	Soil moisture (layer 2)	MI, RF

Table 10. Selected Features.

Feature Code	Full Name	Description
istl2	Instantaneous Soil Temperature Level 2	Soil temperature at a specific depth in the ground.
index_instant	Instantaneous Fire Index	Fire index based on real-time weather data.
blh	Boundary Layer Height	Height of the atmospheric boundary layer, affecting fire behavior.
vilwd	Vertical Integrated Liquid Water Density	Measures moisture content in the atmosphere.
fal	Fractional Land Cover	Percentage of land covered by vegetation, crucial for fuel assessment.
vilwe	Vertical Integrated Liquid Water Equivalent	Measures water content in clouds, affecting precipitation.
viiwe	Vertically Integrated Ice Water Equivalent	Indicator of ice particles in the atmosphere, relevant for cloud and precipitation dynamics.
es	Evaporation Stress Index	Represents water stress on vegetation, affecting fire spread.
gwd	Geostrophic Wind Direction	Wind flow pattern at high altitudes, influencing fire movement.
lgws	Large-Scale Wind Speed	Measures strong winds that impact fire intensity.
nsss	North-South Surface Stress	Measures frictional force of wind along the north-south axis.
DOY	Day of Year	Temporal feature used to track seasonal fire patterns.
ttr	Total Top Radiation	Solar radiation received at the top of the atmosphere, influencing fire ignition.
ttrc	Total Cloud Cover Radiation	Measures cloud influence on radiation, affecting temperature and humidity.
deg0l	Zero-Degree Level Height	Altitude where temperature is 0 °C, affecting precipitation type (rain vs. snow).
mn2t	Minimum 2 m Air Temperature	Lowest daily temperature at 2 m above ground.
pev	Potential Evapotranspiration	Amount of water evaporated and transpired, affecting soil dryness.
flsr	Fraction of Land Surface Reflectance	Indicator of vegetation health and surface dryness.
asn	Accumulated Snow	Total snow accumulation, influencing moisture levels.
vilwn	Vertical Integrated Liquid Water Northward	Measures northward movement of moisture in the atmosphere.
viiwd	Vertical Integrated Ice Water Downward	Measures downward motion of ice water, affecting precipitation.
ie	Instantaneous Evaporation	Measures real-time evaporation, affecting soil moisture.
viiwn	Vertical Integrated Ice Water Northward	Tracks northward movement of ice particles in the atmosphere.
index	General Fire Index	A calculated index representing overall fire risk.
index_max	Maximum Fire Index	The highest fire index value observed in a given period.
stl2	Soil Temperature Level 2	Temperature of soil at a deeper layer than istl2.
bld	Boundary Layer Depth	Measures the depth of the boundary layer, affecting heat and moisture exchange.
mgws	Mean Gust Wind Speed	Measures peak wind gusts, influencing rapid fire spread.
swvl1	Soil Water Volume Level 1	Measures water content in the topsoil layer, affecting vegetation dryness.
chnk	Convective Heating Near Surface	Tracks heat transfer near the ground, influencing local fire conditions.

Table 11. Comparative analysis.

Split	Best for Accuracy	Best for Recall	Best for Precision	Best ROC AUC
90:10	0.9914 (Model-based + RF)	0.9920 (Correlation-based + RNN + LSTM)	0.9967 (Model-based + LightGBM)	0.9644 (Mutual Info + CatBoost)
80:20	0.9897 (RFE + RF)	0.9760 (Mutual Info + LightGBM)	0.9967 (Model-based + LightGBM)	0.9794 (Relief + CatBoost)
70:30	0.9894 (RFE + RF)	0.9921 (Relief + XGBoost)	0.9939 (RFE + Random Forest)	0.9961 (ReliefF + RNN + LSTM)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nasourinia, M.; Passi, K. Wildfire Prediction in British Columbia Using Machine Learning and Deep Learning Models: A Data-Driven Framework. Big Data Cogn. Comput. 2025, 9, 290. https://doi.org/10.3390/bdcc9110290

AMA Style

Nasourinia M, Passi K. Wildfire Prediction in British Columbia Using Machine Learning and Deep Learning Models: A Data-Driven Framework. Big Data and Cognitive Computing. 2025; 9(11):290. https://doi.org/10.3390/bdcc9110290

Chicago/Turabian Style

Nasourinia, Maryam, and Kalpdrum Passi. 2025. "Wildfire Prediction in British Columbia Using Machine Learning and Deep Learning Models: A Data-Driven Framework" Big Data and Cognitive Computing 9, no. 11: 290. https://doi.org/10.3390/bdcc9110290

APA Style

Nasourinia, M., & Passi, K. (2025). Wildfire Prediction in British Columbia Using Machine Learning and Deep Learning Models: A Data-Driven Framework. Big Data and Cognitive Computing, 9(11), 290. https://doi.org/10.3390/bdcc9110290

Article Menu

Wildfire Prediction in British Columbia Using Machine Learning and Deep Learning Models: A Data-Driven Framework

Abstract

1. Introduction

2. Related Literature

3. Materials and Methods

3.1. Dataset and Data Preprocessing

3.2. Methodology

4. Results

4.1. Computational Performance and Validation Strategy

4.2. Baseline Comparison with Operational Indices

5. Conclusions and Future Works

Major Takeaways

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI