A Forest Fire Risk Prediction Framework Based on Machine Learning Models in the Greater Khingan

Li, Heng; Zhang, Jialong; Yang, Jingwen; Teng, Chenkai; Luo, Kai; Sun, Kaiping

doi:10.3390/fire9060256

Open AccessArticle

A Forest Fire Risk Prediction Framework Based on Machine Learning Models in the Greater Khingan

by

Heng Li

^1,2

,

Jialong Zhang

^1,2,*

,

Jingwen Yang

^1,2,

Chenkai Teng

^1,2,

Kai Luo

^1,2 and

Kaiping Sun

³

¹

The Key Laboratory of Forest Resources Conservation and Utilization in the Southwest Mountains of China Ministry of Education, Southwest Forestry University, Kunming 650224, China

²

College of Forestry (College of Asia-Pacific Forestry), Southwest Forestry University, Kunming 650224, China

³

College of Soil and Water Conservation, Southwest Forestry University, Kunming 650224, China

^*

Author to whom correspondence should be addressed.

Fire 2026, 9(6), 256; https://doi.org/10.3390/fire9060256 (registering DOI)

Submission received: 28 April 2026 / Revised: 11 June 2026 / Accepted: 12 June 2026 / Published: 15 June 2026

(This article belongs to the Special Issue Machine Learning (ML) and Deep Learning (DL) Applications in Wildfire Science: Principles, Progress and Prospects (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

The Greater Khingan, a key cold-temperate coniferous forest region in northern China, is frequently affected by forest fires with severe ecological and economic impacts. The study investigates the influence of key environmental and anthropogenic drivers on forest fire susceptibility and evaluates multiple machine-learning approaches for regional fire assessment. Using 2001–2018 fire point data and multi-source remote sensing data, we integrated 13 driving factors across four dimensions: meteorology, topography, vegetation, and human activities. Collinear variables were screened using the Variance Inflation Factor (VIF). Three machine learning models—Logistic Regression (LR), Random Forest (RF), and Support Vector Machine (SVM)—were constructed to assess the long-term potential risk of forest fire occurrence. Driving mechanisms were analyzed using standardized regression coefficients and the SHapley Additive exPlanations (SHAP) interpretable algorithm, and spatial distribution maps of regional forest fire risk were generated based on the optimal model. Among the three models, RF achieved the highest predictive accuracy, with an accuracy of 0.919 and an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.966, significantly outperforming LR and SVM. SHAP analysis reveals that forest fires are primarily driven by climatic factors (Pres and Prec as core drivers), regulated by topographic factors, and weakly affected by human factors. The proposed framework provides an effective tool for long-term forest fire susceptibility assessment by combining robust predictive performance with interpretable model outputs. The findings provide scientific support for long-term strategic forest fire risk zoning, regional firefighting resource allocation, and the formulation of differentiated prevention and control strategies, and also offer methodological references for forest fire prediction in other cold-temperate forest regions in China.

Keywords:

risk assessment model; forest fire risk; driving factors; SHapley Additive exPlanations (SHAP); random forest algorithm

1. Introduction

Wildfire is a core disaster that disrupt the integrity of forest ecosystems. Consequently, effective wildfire management is crucial for sustaining ecosystem functions in Northeast China and protecting strategically important forest resources. Driven by the dual impacts of global climate change and increased economic activities in forested areas, forest fires in China’s Greater Khingan have exhibited the evolutionary characteristics of “concentrated high-incidence periods, rapid spread, and high difficulty in extinguishing”. The region experiences an average of 65 fires per year, with the burned forest area accounting for 0.8% of the total regional forest area [1]. These fires not only cause severe casualties and economic losses but also pose a fatal threat to the cold-temperate coniferous forest ecosystem and rare wild species [2,3]. Studies have shown that the intensity of forest fire carbon emissions in the Greater Khingan increased by 1.5 times between 1990 and 2020, with an average annual new carbon emission of 720 Tg, making it an important local driver of climate change in Northeast Asia [4]. Additionally, climate anomalies leading to prolonged spring droughts and increased summer thunderstorms have further exacerbated the suddenness and destructiveness of forest fires, significantly increasing the difficulty and cost of firefighting and prevention [5]. Therefore, constructing a high-precision forest long-term forest fire risk assessment system in the Greater Khingan has become an urgent task for ecological management of northeast China’s forestry.

The standard definition of forest fire risk, proposed by San-Miguel-Ayanz [6], integrates two interdependent elements: the probability of fire occurrence and the severity of its associated consequences, with increases in either component leading to higher overall risk. The present study is dedicated to forest fire susceptibility mapping in the Greater Khingan, which forms part of the future forest fire risk assessment system. Long-term fire potential mapping is a core tool for evidence-based fire management [7], supporting both pre-season firefighting resource optimization and long-term forest planning, where targeted silvicultural measures can be applied to enhance forest resilience to fire disturbances.

Current mainstream forest fire risk prediction models are mostly built based on short-time series data. These models not only over-rely on meteorological data in the selection of prediction factors [8,9], ignoring key driving factors such as human activities, fuel types, and topographic heterogeneity, but also mostly cover large-scale areas, making it difficult to accurately capture the fire risk gradient differences within the Greater Khingan. As a result, the prediction accuracy of these models in core forest areas is generally lower than 72%, failing to meet the needs of refined prevention and control [10,11]. The aforementioned deficiencies make traditional long-term risk zoning methods unable to adapt to the complex forest fire formation mechanisms in the Greater Khingan, highlighting an urgent need for optimization in both research scale and factor system.

Machine learning algorithms, with their strong fitting ability for multi-source heterogeneous data, provide an effective path to address the accuracy bottleneck in forest fire prediction. Recent international studies have increasingly applied machine learning approaches to wildfire susceptibility and occurrence prediction. For example, Rodrigues and de la Riva [12] demonstrated the effectiveness of machine learning algorithms in Mediterranean fire-prone regions, while Jain et al. [13] reported that ensemble and nonlinear machine learning models generally outperform traditional statistical approaches in large-scale wildfire prediction. LR, RF, and SVM were selected in this study because they represent three widely used machine-learning paradigms and have been successfully applied in wildfire susceptibility assessment [14,15,16]. Based on this, this study takes the Greater Khingan as the research region, adopts long-term datasets from 2001 to 2018 and integrates climatic, terrain-related, and anthropogenic variables to establish a comprehensive forest fire susceptibility assessment framework. By comparing the performance of RF, LR, and SVM models, the study examines the relationships between fire occurrence and its major environmental and anthropogenic controls. Given that the primary objective of this study is to assess long-term forest fire susceptibility rather than annual fire occurrence dynamics, annual-average environmental variables were used to characterize stable environmental backgrounds across the study period. This design allows the model to identify persistent spatial patterns of fire-prone conditions that are relevant to long-term forest management and prevention planning. This study seeks to enhance regional forest fire susceptibility assessment by incorporating a more comprehensive set of environmental and anthropogenic variables while extending the analysis to the scale of the Greater Khingan and provide technical support for the precise prevention and control of northeast China’s forest regions.

2. Materials and Methods

2.1. Study Area

As depicted in Figure 1, our research targets the Greater Khingan Mountains of Heilongjiang Province, geographically bounded by longitudes from 121°12′ to 127°00′ E and latitudes spanning 50°10′–53°33′ N. This territory serves as China’s primary natural distribution zone for boreal cold-temperate coniferous woodland. Dominated by cold-temperate continental monsoon climate, the region is characterized by short warm and humid summers alongside prolonged frigid and arid winters. Its annual mean temperature stays below −2 °C, while annual total precipitation fluctuates within the 300–500 mm range. Elevation across the study domain varies from 137 m up to 1434 m [17]. Located in the Greater Khingan, the study area is characterized by undulating topography dominated by low-to-medium-altitude hills, with coniferous and temperate broadleaved forest cover constituting approximately 73.3% of the total land coverage [18]. It is also among China’s most fire-prone forest zones.

2.2. Materials

2.2.1. Historical Fire Point Datasets

Historical fire occurrence data were obtained from the Fire Information for Resource Management System (FIRMS) based on the MODIS Collection 6.1 Active Fire Product (NASA FIRMS), covering the period from 2001 to 2018 with a spatial resolution of 1 km. The database contains information on fire location, confidence level, fire type, and fire radiative power. To improve data reliability, only vegetation fire records with a confidence level greater than 80% were retained. Furthermore, annual land-use data were used to exclude non-forest fire events, and only fire points located within forested areas of the Greater Khingan Mountains were preserved. A total of 4609 forest fire points were ultimately obtained and labeled as fire samples (value = 1).

To construct non-fire samples, an equal number of randomly generated points (4609) were created within forested areas of the study region and labeled as non-fire samples (value = 0), resulting in a balanced dataset with a fire-to-non-fire ratio of 1:1. To ensure temporal consistency, Non-fire observations were sampled from the period represented by the fire inventory, thereby maintaining consistency in temporal coverage between the two classes. In addition, a minimum buffer distance of 500 m from known fire points was imposed during sample generation to reduce spatial overlap and potential spatial autocorrelation. This threshold was selected based on previous forest fire susceptibility studies and represents a compromise between sample independence and sample availability [19].

The final dataset consisted of 9218 samples, including 4609 fire samples and 4609 non-fire samples. For model development and evaluation, all samples were randomly divided into a training dataset (70%, n = 6457) and an independent test dataset (30%, n = 2761). The spatial distribution of the fire points is shown in Figure 1.

2.2.2. Drivers of Forest Fires

In this study, 13 key factors were selected as driving variables for forest fire occurrence from four dimensions, including topography, vegetation cover, meteorological conditions, and human activities.

Topographic variables, including elevation, slope, and aspect, were derived from the NASA Shuttle Radar Topography Mission (SRTMGL30) Digital Elevation Model (DEM) dataset, which has an original spatial resolution of 30 m. The DEM data were preprocessed through mosaicking, projection transformation, clipping, and terrain analysis to generate the corresponding topographic factors.

Vegetation information was represented by the Normalized Difference Vegetation Index (NDVI), derived from the MODIS Vegetation Indices product (MOD13A3 Version 6.1; Dataset ID: MOD13A3.061) provided by the NASA EOSDIS Land Processes Distributed Active Archive Center (LPDAAC). Monthly NDVI data from 2001 to 2018 were processed using the MODIS Reprojection Tool (MRT). Batch mosaicking, reprojection, and format conversion were subsequently implemented using Python 3.11.3 scripts, and annual mean NDVI values were calculated for each year.

Meteorological variables, including annual precipitation, annual mean Tmp, annual mean Wind, and annual mean RH, were obtained from the China Meteorological Forcing Dataset (China Met, 2025 Release), covering the period from 2001 to 2018. Annual average raster layers were generated using raster calculation procedures.

To ensure consistency among heterogeneous datasets, all raster and vector data were projected to a common coordinate reference system (GCS_WGS_1984; EPSG:4326) and clipped to the boundary of the Greater Khingan study area. Considering that the fire occurrence dataset has a spatial resolution of 1 km, all driving factors were resampled to a uniform spatial resolution of 1 km using the Resample tool in ArcGIS 10.8. Bilinear interpolation was applied to continuous variables (e.g., DEM, Tmp, Prec, Wind, RH, and NDVI), whereas nearest-neighbor interpolation was used for categorical variables to preserve their original class information.

Roads, residential areas, and water systems were obtained from the National Fundamental Geographic Information Database. Distance to road and distance to residential areas were calculated using the nearest neighbor analysis method in ArcGIS. Table 1 presents the complete set of variables used in model development, including the independent variables and the dependent variable, along with their spatial resolution, unit, temporal coverage, and corresponding data sources.

2.3. Methods

This study takes the Greater Khingan in Heilongjiang as the study area. The technical framework for fire prediction is illustrated in Figure 2, which follows the workflow of data preprocessing, feature selection, model construction, and fire risk zoning. Multi-source data including meteorology, vegetation, topography, human activities, and historical forest fire records were collected and preprocessed. Key influencing factors were identified using VIF and RFE. Machine learning algorithms including LR, RF, and SVM were adopted to establish forest fire prediction models. All models were developed and implemented in the Python programming environment. The optimal model was determined after validation and optimization using training and validation datasets. Furthermore, the natural breaks method was applied, combined with the actual occurrence of forest fires [20], to classify the forest fire risk in the study area into four levels: low, moderate, relatively high, and high. The rationality of the risk classification was verified using historical fire data, providing a scientific basis for the precise prevention and control of forest fires.

2.3.1. Feature Selection Methods

Variance Inflation Factor

Multicollinearity, measured by the VIF, is a common interfering factor in multivariate modeling. It can lead to increased bias in model parameter estimation, invalidation of significance tests, and thus reduced prediction accuracy. Therefore, it is necessary to conduct collinearity tests on the independent variables used in modeling. In this study, the VIF was selected as the test indicator to perform collinearity diagnosis on all independent variables involved in LR modeling [15]. Referring to the judgment criteria of relevant studies, when VIF > 10, it indicates that there is significant multicollinearity between the corresponding independent variable and other variables, which requires targeted handling [21].

Recursive Feature Elimination

RFE optimizes the predictor set through a stepwise elimination procedure, retaining variables with relatively greater contributions to model performance. Its operating logic is as follows: relying on the target model, all candidate features are included for training in the first round, and the variable with the lowest contribution is screened out and eliminated through feature importance evaluation; then, the above training elimination process is repeated based on the remaining features, and the iteration is continued until the preset number of features is met, so as to achieve accurate screening of core features [22]. For the input driving factors of the RF model, this study adopted the RFE method for feature optimization to support model construction and subsequent prediction analysis.

2.3.2. Model Construction

Logistic Regression Model

LR serves as a prevalent binary classification approach in relevant research, and it is also a commonly used forest fire risk prediction model at home and abroad [23]. In this study, let P be the probability of forest fire occurrence in a certain grid in the Greater Khingan under specific environmental and anthropogenic conditions, and (1 − P) be the probability of non-occurrence. LR quantifies this probability by constructing a logistic expression related to fire driving factors. The core of this expression is the Sigmoid function, which can map any real value z to the interval (0, 1), ensuring that P is always a valid probability value. The specific formula is as follows:

P = \frac{e^{z}}{e^{z} + 1} = \frac{1}{1 + e^{- z}}

(1)

1 - P = \frac{1}{1 + e^{z}}

(2)

where P is the probability of forest fire occurrence under certain conditions, and e is a constant. When the value of z approaches positive infinity, the Sigmoid function makes P approach 1, indicating a remarkably elevated likelihood of wildfire ignition; on the contrary, when the value of z approaches negative infinity, P approaches 0, indicating that the probability of fire occurrence is extremely low.

To construct the LR model, fire points were assigned a value of 1 and an equal number of randomly generated non-fire points were assigned a value of 0, resulting in a balanced dataset. LR was implemented using the Logit function in the Statsmodels package (version 0.14 1), and model coefficients were estimated by maximum likelihood estimation (MLE) without regularization. To reduce the influence of random sample partitioning on variable selection, the modeling dataset was randomly divided into training and validation subsets (70%/30%) five times. Variables that remained statistically significant (p < 0.05) in at least three of the five runs were retained and subsequently used to construct the final LR model. The final model was evaluated using the same independent testing dataset adopted for model comparison.

Random Forest Model

RF is an ensemble learning algorithm proposed by Breiman [24], which constructs multiple decision trees using bootstrap sampling and random feature selection. By aggregating the predictions of individual trees through majority voting, RF can effectively improve classification accuracy and reduce overfitting. Owing to its ability to model complex nonlinear relationships and handle high-dimensional data, RF has been widely applied in forest fire susceptibility assessment [25].

Consistent with the sample construction strategy used in the LR model, fire samples were assigned a value of 1 and non-fire samples a value of 0, with a ratio of 1:1. The complete dataset was randomly divided into a training set (70%) and an independent testing set (30%).

To improve predictive performance, a series of candidate RF parameter configurations were systematically evaluated using grid search. Four hyperparameters (n_estimators, max_features, max_depth, and min_samples_leaf) were included in the tuning process. The corresponding search space was defined as follows: n_estimators = [100, 200, 300, 500], max_features = [2, 3, 4, 5], max_depth = [10, 20, 30, None], and min_samples_leaf = [1, 2, 4]. Model performance under different parameter settings was assessed through ten-fold cross-validation, and the configuration associated with the highest mean AUC value was selected for subsequent model development.

Several combinations of RF hyperparameters were tested and compared. According to the validation results, the final model employed 500 trees, a tree depth limit of 20, two predictor variables considered at each split, and a minimum leaf-node size of one sample.

Support Vector Machine Model

Support Vector Machine (SVM) is a supervised machine learning algorithm widely used for binary classification problems and has demonstrated strong performance in forest fire susceptibility assessment. The core idea of SVM is to identify an optimal separating hyperplane that maximizes the margin between fire and non-fire samples in the feature space. The classification boundary is determined by a subset of training samples known as support vectors, which are located closest to the hyperplane [26,27].

In this study, fire samples were assigned a value of 1, whereas non-fire samples were assigned a value of 0. Consistent with the LR and RF models, the complete dataset was randomly divided into a training set (70%) and an independent testing set (30%). Prior to model training, all predictor variables were standardized using the StandardScaler method to eliminate the influence of differences in variable magnitude.

To identify the most suitable kernel function for forest fire prediction, four commonly used kernel functions, including the linear kernel, sigmoid kernel, polynomial kernel, and Radial Basis Function (RBF) kernel, were evaluated (Table 2). Comparative experiments indicated that the RBF kernel achieved the best predictive performance and was therefore selected for subsequent modeling.

To improve model robustness and avoid overfitting, a grid-search strategy combined with ten-fold stratified cross-validation was employed on the training dataset. The penalty parameter (C) was searched within the range of (0.1, 1, 10, 100), while the kernel coefficient (γ) was searched within the range of (0.001, 0.01, 0.1, 1). Model performance during parameter optimization was evaluated using the mean Area Under the Receiver Operating Characteristic Curve (AUC). The optimal parameter combination obtained from the cross-validation procedure was subsequently used to construct the final SVM model. To account for potential class imbalance, the class weight parameter was set to “balanced”.

2.3.3. Interpretation

Machine learning models are black boxes that can fit parameters in hidden dimensions and do not provide explicit relationships, such as regression coefficients and confidence intervals [28]. To explain model outputs, the SHapley Additive exPlanations (SHAP) method based on cooperative game theory combines optimal credit allocation with local interpretation using the SHAP value of each feature [29]. These values quantify the magnitude and direction of variable contributions, where positive values indicate favorable contributions and negative values indicate adverse effects. The relative importance of variables is determined by averaging the absolute SHAP values of each case. Nonlinear relationships are visualized through partial dependence plots, which plot SHAP values against the values of the focal variable while keeping other variables constant. The partial dependence plots adopt Locally Weighted Scatterplot Smoothing (LWSS) regression with a fraction of 0.3. SHAP values provide key insights into the local importance of variables in specific cases, which cannot be evaluated by the model’s variable importance. In each fire case, the primary and secondary contributing variables are determined by the maximum and second maximum average absolute SHAP values, respectively.

To improve model interpretability and quantify the contribution of each driving factor to forest fire occurrence, SHAP (SHapley Additive exPlanations) analysis was performed on the RF model. Because RF is a tree-based ensemble algorithm, the TreeSHAP method was employed to efficiently estimate feature contributions. SHAP values were calculated using the training dataset and were subsequently used to generate SHAP summary plots and feature importance rankings. The mean absolute SHAP value of each variable was adopted as an indicator of its overall contribution to model predictions.

2.3.4. Performance Evaluation Methods

Confusion matrix and ROC curve are the two main model evaluation methods, which are often used to measure the prediction accuracy of model classification [30].

Confusion Matrix Analysis

The confusion matrix is a “statistical table of classification results” for binary classification models and a magnifying glass for model error types. Its core is a key statistical tool that describes the prediction results of binary classification models [31]. By matching the true labels of samples with the model’s prediction results, a matrix framework containing four types of basic results as shown in Table 3 is constructed.

Based on the above four types of results, indicators such as recall rate, precision rate, and overall accuracy can be further calculated, realizing comprehensive measurement of the false negative control, false positive control, and overall effect of the fire prediction model.

The Receiver Operating Characteristic Curve

The ROC curve is a classic curve tool for measuring the generalization ability of binary classification models. Its core logic is to dynamically adjust the “probability threshold for the model to judge a fire point”, record the corresponding “True Positive Rate (TPR)” and “False Positive Rate (FPR)” under different thresholds, and fit these points into a continuous curve. The True Positive Rate (TPR) corresponds to the recognition rate of fire point samples, reflecting the model’s “ability to control false negative risks”; the False Positive Rate (FPR) corresponds to the false positive rate of non-fire point samples, reflecting the model’s “ability to control false positive costs” [32,33].

The Area Under the ROC Curve is the core indicator for quantifying the performance of the curve, with a value range of 0 to 1: the closer the AUC is to 1, the stronger the model’s ability to distinguish between fire points and non-fire points. In the class-imbalanced scenario of “excessively high proportion of non-fire point samples” in fire prediction, AUC can effectively avoid the interference of sample distribution bias and more accurately reflect the model’s ability to recognize fire point samples [34].

3. Results

3.1. Analysis of Characteristic Factor Screening Results

In this study, forest fire driving factors with a VIF of less than 10 were screened out using the VIF method for the LR model, as presented in Table 4.

Five intermediate models were obtained by fitting based on nine training samples. Among them, NDVI, Tmp, Pres, DTR, DTRA, DEM, and Slope were significant in at least three of these models, to identify the key driving factors of forest fire occurrence, this study established a LR model based on forest-fire-related data in the study area, and used standardized regression coefficients to quantify the influence degree of each driving factor on forest fire occurrence.

As shown in Table 5, the direction and intensity of each driving factor on forest fire occurrence differed significantly, together constituting the comprehensive driving mechanism of forest fire occurrence in the study area. To further clarify the meaning of each column in Table 5: Estimation coefficient indicates the effect of each driving factor on the log-odds of fire occurrence in the LR model. Positive values suggest that higher values of the factor increase the likelihood of fire occurrence, while negative values indicate a suppressing effect. For example, NDVI has a negative coefficient, meaning denser vegetation reduces fire probability. Standard error quantifies the uncertainty of the estimated coefficient. Smaller values indicate more reliable estimates of the factor’s effect on fire occurrence. Chi-square value tests the statistical significance of each factor’s contribution. Larger values imply that the factor has a stronger influence in explaining the variability in fire occurrence across the study area. Standardized regression coefficient allows comparison across variables on a common scale, showing the relative importance of each factor. For example, Tmp and Pres have the largest positive standardized coefficients, indicating they are the core meteorological drivers that enhance fire susceptibility, while NDVI has the largest negative standardized coefficient, highlighting its key inhibitory role. Collectively, these measures provide a quantitative connection between model outputs and the observed phenomena of forest fires, allowing interpretation of how meteorological, vegetation, topographic, and human activity factors jointly regulate fire occurrence in the Greater Khingan region.

3.2. Driving Factor Selection and Importance Analysis Based on RF and SVM

Table 6 shows the driving factors screened by the RFE method, and the screened factors were used for the construction of the RF and SVM models.

Under the RF model, the importance ranking of forest fire driving factors presents clear differences across variables. As shown in Figure 3a, climatic factors dominate the overall feature importance, with Pres ranking first, followed by DEM and Tmp. This is further corroborated by the SHAP summary plot (Figure 3b), where Pres exhibits the strongest positive influence on model output, indicating that higher Pres significantly increases relative fire likelihood. Notably, NDVI ranks third in feature importance, but shows a distinct negative driving effect: higher NDVI values (red points) correspond to negative SHAP values, meaning denser vegetation cover strongly suppresses fire risk by limiting fuel availability and increasing surface moisture. Tmp also shows a clear positive trend, with higher temperatures associated with elevated fire susceptibility, consistent with its high feature importance. Prec ranks among the top factors and exhibits a negative driving effect: higher precipitation correlates with negative SHAP values, indicating that increased rainfall effectively reduces fire risk by raising surface humidity. Other meteorological factors, including RH and DTR, also show notable importance, with their SHAP distributions reflecting complex regulatory roles in fire dynamics. At the topographic level, DEM and Slope show moderate importance. Slope generally acts as a positive driver, with steeper slopes linked to higher fire risk, likely due to accelerated fuel drying and fire spread. DEM displays a two-way regulatory pattern, with its effect shifting from negative to positive across different elevation zones, reflecting its role in modulating temperature, vegetation, and moisture gradients.

Anthropogenic factors (POP and GDP) rank lowest in both feature importance and mean absolute SHAP values, with their SHAP distributions tightly clustered around zero. This confirms that human activities exert negligible influence on fire occurrence in the sparsely populated study area, where fire dynamics are primarily governed by natural environmental processes. Collectively, the results from feature importance and SHAP analysis consistently support a “climate-dominated, topography-regulated, human-weak” driving mechanism for forest fires in the study area.

3.3. Accuracy Evaluation

To further explore the applicability of different models in predicting the probability of forest fire occurrence in the southwest region, the ROC curve was used to evaluate the accuracy of different models. Figure 4, Figure 5 and Figure 6 show the ROC curves and AUC areas of the LR model, RF model, and SVM model, respectively. Confusion matrix accuracy results show that the overall accuracy of the LR model is 74.2%, that of the SVM model is 86.7%, and that of the RF model reaches 91.9% (Table 7). Through accuracy verification, the RF model outperforms the LR model and the SVM model in forest fire risk prediction.

For the SVM model, the Gaussian RBF kernel was adopted, as it is well suited to capturing complex nonlinear relationships between multi-source environmental drivers and forest fire occurrence. Model performance was assessed using 10-fold stratified cross-validation on the training set, with key evaluation metrics including accuracy, AUC, recall, and F1-score. The results showed that the SVM model with the RBF kernel achieved strong predictive performance, with an accuracy of 0.867, an AUC of 0.929, a recall of 0.842, and an F1-score of 0.854. These findings confirm that the Gaussian radial basis function kernel effectively balances model fitting and generalization.

3.4. Thematic Map of Forest Fire Risk Prediction

Given the excellent performance of the RF model regarding fire prediction, this study conducted full-sample fitting of the fire risk in the Greater Khingan region (Mohe City, Tahe County, Huma County) based on 12 key driving factors screened by the RFE method (Figure 7). Regions with ultra-high wildfire susceptibility (deep crimson) are predominantly distributed across the eastern and northeastern parts of Tahe County, forming continuous, clustered, high-hazard patches. Such spatial patterns suggest that severe wildfire incidents are highly likely to occur when favourable meteorological conditions emerge. By comparison, zones with minimal fire hazard (dark green) are primarily located in central and western Mohe, as well as in the majority of Huma County. These areas exhibit low wildfire occurrence probability, indicating robust ecological stability and a low likelihood of fire initiation. The high fire risk areas (light red) are scattered around the extremely high risk areas, mainly in the eastern edge of Tahe County, forming a transition zone between high-risk and medium-risk areas; the medium-risk areas (yellow) are interspersed between high and low risk areas in the form of fragmented patches, mainly distributed along the border between Tahe County and Huma County, representing a medium level of fire hazard; the low fire risk areas (light green) widely cover Mohe City and Huma County, forming a buffer zone between extremely low-risk and medium risk areas, with only occasional fire possibilities. To facilitate interpretation of the susceptibility map, historical fire points from 2001–2018 were overlaid on Figure 7 as deep red dots. The spatial distribution of the fire points shows a clear concentration in the northern part of the study area, while the predicted susceptibility pattern exhibits an increasing trend toward the north and northeast.

4. Discussion

4.1. Correlation Characteristics Between Different Driving Factors and Forest Fires

This study confirms that forest fire occurrence in the Greater Khingan follows a core driving mechanism of climate dominance, topographic regulation, and weak human influence. The correlation characteristics between factors of various dimensions and forest fires are highly consistent with the cold-temperate physical geographical background of the study area, and this driving pattern is highly consistent with findings from remote sensing-based studies on forest fire mechanisms in cold-temperate forest zones [35,36]. Among climatic factors, Pres and Prec are core drivers. Increased Pres corresponds to the control of regional dry weather systems, which significantly raises fuel dryness and acts as an important trigger for forest fires. Precipitation, by contrast, exerts a negative driving effect by increasing surface and fuel moisture, effectively inhibiting fire ignition and spread—a pattern consistent with studies on meteorological drivers of forest fires in northern forest regions using multi-source remote sensing data [37].

Tmp and NDVI show opposite driving effects. Tmp presents a positive effect: rising temperatures accelerate the dehydration of dead leaves and other combustibles and thus promote fire occurrence. By contrast, NDVI exerts a negative driving effect. Higher NDVI values correspond to denser forest canopy and greater live vegetation cover, which can increase surface moisture and limit fuel availability, thereby reducing fire risk. This result is consistent with the dominance of Larix gmelinii forests in the study area, where dense stands tend to suppress fire ignition and spread. Remotely sensed NDVI has been widely adopted to characterize large-scale spatial patterns of vegetation cover and fuel abundance, and remains a robust and practical indicator for regional fire risk assessment [38]. However, NDVI mainly reflects live vegetation greenness and canopy conditions and cannot directly represent dead fuel accumulation, litter depth, or fuel moisture content. These fuel properties are known to play important roles in fire ignition and propagation in boreal coniferous forests. Therefore, NDVI should be regarded as a regional-scale proxy of vegetation conditions rather than a complete description of fuel characteristics.

Among topographic factors, Slope shows a positive driving effect, steep slopes not only accelerate fuel drying but also promote downslope fire spread. Elevation exhibits bidirectional regulation, exerting differentiated impacts on fire risk across elevation zones by altering vertical variations in Tmp and Prec. Integrated remote sensing and GIS technologies can accurately characterize such spatially heterogeneous regulatory effects of terrain on forest fires. Among human factors, POP, GDP, DTRA and DTR show weak driving effects with SHAP values close to zero. This aligns with realistic spatial attributes of the sparse-populated research region where human activities are concentrated locally, indicating that natural factors dominate the formation of regional forest fire patterns.

4.2. Comparative Analysis and Applicability of Different Models

Significant performance differences were observed among the LR, RF and SVM models in forest fire prediction for the Greater Khingan, and their applicability was closely associated with regional data attributes, as well as intricate ignition mechanisms underlying wildfire outbreaks. As a traditional binary classification algorithm, the LR model features simple calculation and strong interpretability, yet it yielded the lowest prediction accuracy and struggled to fit the complex nonlinear relationships of forest fire occurrence, only being able to capture the linear correlations between driving factors and forest fires, thus being suitable for the preliminary qualitative assessment of forest fire risks. The SVM model maps low-dimensional features to a high-dimensional space via kernel functions, which effectively addresses the nonlinear relationships among factors and results in significantly better prediction performance than the LR model. However, this model is sensitive to parameter settings, as the selection of kernel functions and regularization parameters directly affects prediction outcomes; additionally, its computational efficiency decreases when processing large-sample data, making it applicable to forest fire prediction scenarios with moderate scales and moderate data dimensions [39].

The RF model exhibited the optimal prediction performance: its voting mechanism based on ensemble learning effectively reduces the risk of overfitting and it has strong adaptability to multi-source heterogeneous data. Moreover, it can quantitatively output the importance of each factor, achieving both high prediction accuracy and good interpretability. In this study, the hyperparameters of the RF model were optimized via 10-fold cross-validation grid search, and the final SHAP interpretation was implemented based on the single optimal retrained model rather than averaged across cross-validation folds. Relevant studies have also confirmed that the RF model is one of the optimal algorithms for processing high-dimensional remote sensing forest fire data and achieving refined prediction [40]. Meanwhile, this model has a certain robustness to outliers and missing values [41], which perfectly adapts to the characteristics of complex forest fire driving factors and high data dimensionality in the Greater Khingan, making it the optimal algorithm for refined forest fire prediction in this region.

It should be noted that the objective of this study was not operational fire forecasting but long-term forest fire susceptibility assessment. Therefore, annual average environmental variables from 2001–2018 were adopted to characterize the stable spatial distribution of fire-prone conditions. Under this framework, the models identify areas with persistently elevated fire potential rather than predicting the exact timing of fire occurrence. Consequently, the resulting susceptibility map is more suitable for long-term forest management, prevention planning, and resource allocation than for short-term fire early warning applications. The interpretation of modelling results ought to be combined with the time span of explanatory variables adopted herein. The susceptibility maps were developed using environmental data from 2001–2018 and are intended to characterize long-term spatial patterns of forest fire susceptibility rather than current-year fire risk. Because the model captures stable relationships between fire occurrence and environmental drivers over an 18-year period, the identified high-risk areas remain informative for regional-scale forest management and strategic fire prevention planning. However, climate change, vegetation dynamics, and changes in human activities after 2018 may alter local fire susceptibility patterns. Therefore, periodic model updating using newly available datasets is recommended to maintain the long-term applicability of the susceptibility maps. In addition, all three models demonstrated stable classification performance after balancing fire and non-fire samples. The balanced sampling strategy was adopted to reduce model bias toward the majority class and to facilitate comparisons among different algorithms. However, real-world landscapes are typically characterized by a much larger proportion of non-fire pixels than fire pixels. Therefore, the reported performance metrics should be interpreted within the context of the balanced dataset used in this study, and further validation under naturally imbalanced conditions is recommended in future research.

4.3. Limitations and Future Perspectives

Although the proposed framework achieved satisfactory predictive performance and provided useful insights into the spatial distribution of forest fire susceptibility in the Greater Khingan, several aspects warrant further improvement.

First, the present study focused on long-term fire susceptibility assessment using annual average environmental variables. While this design effectively captures stable spatial patterns of fire-prone conditions, it cannot represent short-term variations associated with seasonal weather extremes and therefore is not intended for operational fire forecasting.

Second, the current modeling framework treats each grid cell as an independent observation. In reality, wildfire occurrence and spread exhibit strong spatial dependence, and fire events in neighboring locations may influence subsequent fire dynamics. Future studies could integrate spatial autocorrelation metrics, neighborhood effects, or fire spread simulation approaches to better represent wildfire propagation processes.

Third, although the balanced sampling strategy improved model training and comparison among algorithms, real-world forest landscapes are inherently dominated by non-fire pixels. Additional validation under naturally imbalanced conditions would further improve the operational applicability of the model.

Future research should also explore the integration of higher-temporal-resolution climate data, refined fuel descriptors, and spatially explicit modeling techniques. Such developments may contribute to the establishment of a more comprehensive forest fire risk assessment framework for cold-temperate forest ecosystems.

5. Conclusions

This study takes the Greater Khingan region as the research area. Using long-term time-series data from 2001 to 2018, a forest fire prediction indicator system was constructed by integrating 13 key factors across four dimensions: terrain, fuel, meteorology, and human activities. The fire prediction capability among LR, RF and SVM algorithms was comprehensively investigated and contrasted. The driving mechanism of forest fires was interpreted using the SHAP method, and the RF model with superior accuracy was subsequently utilized to map regional wildfire susceptibility. The primary findings are summarized below: The driving mechanism of forest fire occurrence in the Greater Khingan region is characterized by climate dominance, topographic regulation, and weak human influence. Among climatic factors, Pres and Prec are the core driving forces, with Pres, Tmp showing positive driving effects, while Prec exhibits a negative driving effect. Among topographic factors, slope positively promotes the occurrence of forest fires, and elevation presents a bidirectional regulatory effect. Human factors such as POP and GDP exert negligible driving effects on forest fires due to the sparse POP in the study area.

The three machine learning models show significant differences in predictive performance, among which the RF model performs the best, with an accuracy of 0.919 and an AUC of 0.966, far superior to LR and SVM. This model combines high prediction accuracy, strong data adaptability, and clear factor interpretability, making it the most suitable algorithm for forest fire prediction in the Greater Khingan region.

Forest fire risk in the Greater Khingan region displays significant spatial heterogeneity. Zones with extreme wildfire vulnerability predominantly cluster in eastern and northeastern Tahe, presenting continuous high-hazard patches. By contrast, areas with negligible fire susceptibility are scattered across central-western Mohe and the majority territory of Huma. High, medium, and low risk zones are interspersed in the form of transition zones and fragmented patches. This spatial pattern is highly consistent with the spatial differentiation of climatic and topographic factors. The proposed framework effectively characterizes the long-term spatial distribution of forest fire susceptibility in the Greater Khingan region and identifies the dominant environmental drivers associated with forest fire occurrence. It provides important technical support and data references for the refined prevention and control, as well as scientific deployment of forest fires in cold-temperate forest zones, and also offers methodological references for forest long-term forest fire risk assessment research in other forest regions of China.

Author Contributions

H.L.: Composition—initial version, Visualization, Software, Methodology, Investigation, Conceptualization; J.Z.: Composition—review & refinement, Supervision, Project oversight, Conceptualization; J.Y.: Verification, Data provision, Project oversight; C.T.: Composition—review & refinement, Statistical analysis; K.L.: Visualization, Supervision; K.S.: Software. All authors have read and agreed to the published version of the manuscript.

Funding

This research received support from the National Natural Science Foundation of China (grant numbers 42207507, 32260390), Open Grant for Key Laboratory of Sustainable Forest Ecosystem Management-Ministry of Education, School of Forestry, Northeast Forestry University (grant number KFJJ2023YB02), the Yunnan Fundamental Research Projects (grant numbers 202501AS070047, 202501AU070062), the Yunnan Provincial Department of Education Science Research Fund Project (grant number 2024J0665), and Forestry Innovation Programs of Southwest Forestry University (grant number LXXK-2023Z06).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to (specify the reason for the restriction).

Acknowledgments

We recognize the valuable input and feedback from the editors and anonymous reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yuan, S.; Wang, M.; Shu, L.; Ma, Q.; Song, J.; Xiao, F.; Zhou, X.; Wang, J. Characteristics of Lightning Ignition and Spatial–Temporal Distributions Linked with Wildfires in the Greater Khingan Mountains. Fire 2025, 8, 474. [Google Scholar] [CrossRef]
Wang, Y.; Liu, X.; Jin, H.; Zeng, X.; Zhang, X.; Kang, H.; Kang, S.; Li, Y.; Zhang, Q. Climate-induced permafrost degradation exerts species-specific impacts on pine and larch growth in the Da Xing’anling Mountains, Northeast China. Agric. For. Meteorol. 2025, 372, 110665. [Google Scholar] [CrossRef]
Zhang, H.; Liu, X.; Ban, Q.; Shu, Y. Data source effects on forest fire prediction models in Northern China under multiple climate scenarios. Res. Sq. 2026, preprint (Version 1). [Google Scholar] [CrossRef]
Liu, Y.; Ding, A. Contrasting trends of carbon emission from savanna and boreal forest fires during 1999–2022. Meteorol. Appl. 2024, 31, e2177. [Google Scholar] [CrossRef]
Gao, C.; Shi, C.; Li, J.; Yuan, S.; Huang, X.; Zhang, Q.; Ma, Q.; Wu, G. Igniting lightning, wildfire occurrence, and precipitation in the boreal forest of northeast China. Agric. For. Meteorol. 2024, 354, 110081. [Google Scholar] [CrossRef]
San-Miguel-Ayanz, J.; Moreno, J.M.; Camia, A. Analysis of large fires in European Mediterranean landscapes: Lessons learned and perspectives. For. Ecol. Manag. 2013, 294, 11–22. [Google Scholar] [CrossRef]
Pausas, J.G.; Keeley, J.E. Wildfires and global change. Front. Ecol. Environ. 2021, 19, 387–395. [Google Scholar] [CrossRef]
Rafaqat, W.; Iqbal, M.; Kanwal, R.; Song, W. Study of Driving Factors Using Machine Learning to Determine the Effect of Topography, Climate, and Fuel on Wildfire in Pakistan. Remote Sens. 2022, 14, 1918. [Google Scholar] [CrossRef]
Guria, R.; Mishra, M.; Mohanta, S.; Paul, S. Forest fire probability zonation using dNBR and machine learning models: A case study at the Similipal Biosphere Reserve (SBR), Odisha, India. Environ. Sci. Pollut. Res. 2025, 32, 31375–31396. [Google Scholar] [CrossRef] [PubMed]
Su, Z.; Hu, H.; Tigabu, M.; Wang, G.; Zeng, A.; Guo, F. Geographically weighted negative Binomial regression model predicts wildfire occurrence in the Great Xing’an Mountains better than negative Binomial Model. Forests 2019, 10, 377. [Google Scholar] [CrossRef]
Partheepan, S.; Sanati, F.; Hassan, J. Modelling bushfire severity and predicting future trends in Australia using remote sensing and machine learning. Environ. Model. Softw. 2025, 188, 106377. [Google Scholar] [CrossRef]
Rodrigues, M.; De la Riva, J. An insight into machine-learning algorithms to model human-caused wildfire occurrence. Environ. Model. Softw. 2014, 57, 192–201. [Google Scholar] [CrossRef]
Jain, P.; Coogan, S.C.; Subramanian, S.G.; Crowley, M.; Taylor, S.; Flannigan, M.D. A review of machine learning applications in wildfire science and management. Environ. Rev. 2020, 28, 478–505. [Google Scholar] [CrossRef]
Abohaia, Z.; Elkhouly, A.; Barachi, M.E.; Al-Khatib, O. Regional Prediction of Fire Characteristics Using Machine Learning in Australia. Fire 2025, 8, 330. [Google Scholar] [CrossRef]
Wang, C.; Liu, H.; Xu, Y.; Zhang, F. A forest fire prediction framework based on multiple machine learning models. Forests 2025, 16, 329. [Google Scholar] [CrossRef]
Sheriff, R.; Meer, M.S.; Aslam, R.W.; Said, Y. Machine learning-based forest fire susceptibility mapping using random forest and CART models. Rangel. Ecol. Manag. 2025, 102, 96–109. [Google Scholar] [CrossRef]
Tian, X.; Shu, L.; Wang, M.; Zhao, F.; Chen, L. The fire danger and fire regime for the Daxing’anling region for 1987–2010. Procedia Eng. 2013, 62, 1023–1031. [Google Scholar] [CrossRef]
Song, K.; Wang, Z.; Zhang, D.; Yu, S.; Zhang, T.; Wang, X.; Li, X.; Du, B. Elevational Differentiation in Earlywood and Latewood Density Responses of Pinus sylvestris var. mongolica to Climate in the Northern Greater Khingan Range. Forests 2026, 17, 99. [Google Scholar] [CrossRef]
Rew, J.; Cho, Y.; Hwang, E. A Robust Prediction Model for Species Distribution Using Bagging Ensembles with Deep Neural Networks. Remote Sens. 2021, 13, 1495. [Google Scholar] [CrossRef]
Gui, R.; Song, W.; Lv, J.; Lu, Y.; Liu, H.; Feng, T.; Linghu, S. Digital elevation model-driven river channel boundary monitoring using the natural breaks (Jenks) method. Remote Sens. 2025, 17, 1092. [Google Scholar] [CrossRef]
Xuejiao, C.; Kaitong, Z.; Jiao, W.; Munan, W.; Ying, Q. Analysis of surface albedo responses to forest fires in the Great Xing’an Range, China. Remote Sens. Nat. Resour. 2025, 37, 212–219. [Google Scholar]
Díaz-Vázquez, D.; Casillas-García, L.F.; Garcia-Gonzalez, A.; Montero, S.H.G.; Rubio, J.I.M.; Llamas, J.J.L.; Hernandez, M.S.G. Integrating Remote Sensing and machine learning for dynamic burn probability mapping in data-limited contexts. Remote Sens. Appl. Soc. Environ. 2025, 38, 101554. [Google Scholar] [CrossRef]
Durlević, U.; Ilić, V.; Valjarević, A. Wildfire susceptibility mapping using deep learning and machine learning models based on multi-sensor satellite data fusion: A case study of Serbia. Fire 2025, 8, 407. [Google Scholar] [CrossRef]
Breiman, L. Random forests, machine learning 45. Mach. Learn. 2001, 2, 199–228. [Google Scholar]
Duan, J.; Hu, J.; Fu, Y.; Liu, Q.; Li, R.; Wang, Y. Estimation of fire counts and fire radiative power using satellite optical and microwave vegetation indices with random forest method. J. Geophys. Res. Atmos. 2025, 130, e2024JD041680. [Google Scholar] [CrossRef]
Gigović, L.; Pourghasemi, H.R.; Drobnjak, S.; Bai, S. Testing a new ensemble model based on SVM and random forest in forest fire susceptibility assessment and its mapping in Serbia’s Tara National Park. Forests 2019, 10, 408. [Google Scholar] [CrossRef]
Jalili, A.; Saleki, Z.; Luo, Y.; Pan, F.; Chen, A.X.; Draayer, J.P. Performance of various kernel functions for mass prediction with support vector machine. Eur. Phys. J. A 2025, 61, 143. [Google Scholar] [CrossRef]
Singh, H.; Ang, L.-M.; Paudyal, D.; Acuna, M.; Srivastava, P.K.; Srivastava, S.K. A Comprehensive Review of Empirical and Dynamic Wildfire Simulators and Machine Learning Techniques used for the Prediction of Wildfire in Australia. Technol. Knowl. Learn. 2025, 30, 935–968. [Google Scholar] [CrossRef]
Dai, S.; Zhang, J.; Huang, Z.; Zeng, S. Fire prediction and risk identification with interpretable machine learning. J. Forecast. 2025, 44, 1699–1715. [Google Scholar] [CrossRef]
Bouzeraa, Y.; Bouchemal, N.; Djaaboub, S.; Hristov, G.; Zahariev, P. Machine Learning-Based Wildfire Susceptibility Mapping: A GIS-Integrated Predictive Framework. Appl. Sci. 2025, 15, 12188. [Google Scholar] [CrossRef]
Fan, C.-L. Evaluation model for crack detection with deep learning: Improved confusion matrix based on linear features. J. Constr. Eng. Manag. 2025, 151, 04024210. [Google Scholar] [CrossRef]
Chen, R.; Zhang, Y.; Li, Y.; Yebra, M.; Fan, C.; Zhang, H.; He, B. Probabilistic mapping of high-intensity forest fire potential via time series machine learning and remote sensing-informed fire spread simulations. Remote Sens. Environ. 2026, 334, 115233. [Google Scholar] [CrossRef]
Ahmad, H.; Wu, Z.; Huang, H.; Muhammad, S.; Hayat, M.; Abbas, K.; Yang, X.; Shu, Z. A comparative evaluation of forest fire hazard vulnerability through geographic information system-based techniques. Front. For. Glob. Change 2025, 8, 1635041. [Google Scholar] [CrossRef]
Smithson, M. The receiver operating characteristic area under the curve (or mean ridit) as an effect size. Psychol. Methods 2025, 30, 678. [Google Scholar] [CrossRef] [PubMed]
Bhattarai, H.; Val Martin, M.; Sitch, S.; Yung, D.H.; Tai, A.P. Global patterns and drivers of climate-driven fires in a warming world. EGUsphere 2025, 2025, 1–28. [Google Scholar] [CrossRef]
Gaboriau, D.M.; Remy, C.C.; Girardin, M.P.; Asselin, H.; Hély, C.; Bergeron, Y.; Ali, A.A. Temperature and fuel availability control fire size/severity in the boreal forest of central Northwest Territories, Canada. Quat. Sci. Rev. 2020, 250, 106697. [Google Scholar] [CrossRef]
Lou, L.; Ma, W.; Cheng, P.; Liu, H.; Huang, Y. Climatic and Fuel Drivers of Lightning-Induced Forest Fire Burned Area in the Da Hinggan Ling Region, Northeast China. Remote Sens. 2026, 18, 657. [Google Scholar] [CrossRef]
Pereira-Pires, J.E.; Aubard, V.; Ribeiro, R.A.; Fonseca, J.M.; Silva, J.M.; Mora, A. Fuel break vegetation monitoring with sentinel-2 ndvi robust to phenology and environmental conditions. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS; IEEE: Piscataway, NJ, USA, 2021; pp. 6264–6267. [Google Scholar]
Anguita, D.; Ghio, A.; Greco, N.; Oneto, L.; Ridella, S. Model selection for support vector machines: Advantages and disadvantages of the machine learning theory. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN); IEEE: New York, NY, USA, 2010; pp. 1–8. [Google Scholar]
Ramo, R.; Chuvieco, E. Developing a random forest algorithm for MODIS global burned area classification. Remote Sens. 2017, 9, 1193. [Google Scholar] [CrossRef]
Pearson, R.K. Outliers in process modeling and identification. IEEE Trans. Control Syst. Technol. 2002, 10, 55–63. [Google Scholar] [CrossRef]

Figure 1. Spatial Location of the Greater Khingan in Northeast China; Temporal and Spatial Patterns of Wildfire Locations Across the Greater Khingan Region During 2001–2018.

Figure 2. Technology road map for this study. Different colors are used only to distinguish workflow components and do not represent specific quantitative information.

Figure 3. The contribution and SHAP value of the variables in the RF model. All results presented in this figure are derived from the training dataset (n = 6757). (a) Comparison of RF feature importance and mean absolute SHAP values for the selected forest fire driving factors; (b) SHAP summary diagram reflecting the influence magnitude and positive–negative effects of individual drivers on RF fire prediction outputs, where color indicates the magnitude of feature values and the horizontal axis represents the SHAP value.

Figure 4. ROC curve of LR model.

Figure 5. ROC curve of RF model.

Figure 6. ROC curve of SVM model.

Figure 7. Distribution Map of Fire Risk Prediction in the Greater Khingan.

Table 1. Predictor variables adopted for wildfire susceptibility modeling.

Factor Category	Indicator Title (Abbreviation)	Resolution	Unit	Temporal Coverage	Data Source
Independent Variables	Elevation (DEM)	30 m	m	2000	https://www.nasa.gov, accessed on 1 September 2025
	Slope	30 m	°	2000	https://www.nasa.gov, accessed on 1 September 2025
	Aspect	30 m		2000	https://www.nasa.gov, accessed on 1 September 2025
	NDVI	1 km		2001–2018	https://www.earthdata.nasa.gov, accessed on 1 September 2025
	Precipitation (Prec)	0.01° (~1 km)	mm	2001–2018	http://www.ncdc.ac.cn, accessed on 3 September 2025
	Temperature (Tmp)	0.01° (~1 km)	°C	2001–2018	http://www.ncdc.ac.cn, accessed on 3 September 2025
	Pressure (Pres)	0.01° (~1 km)	hPa	2001–2018	http://www.ncdc.ac.cn, accessed on 3 September 2025
	Wind	0.01° (~1 km)	m/s	2001–2018	http://www.ncdc.ac.cn, accessed on 3 September 2025
	Relative Humidity (RH)	0.01° (~1 km)	%	2001–2018	http://www.ncdc.ac.cn, accessed on 3 September 2025
	Distance to Roads (DTR)		km	2001–2018	https://www.resdc.cn, accessed on 2 September 2025
	Distance to Residential Areas (DTRA)		km	2001–2018	https://www.resdc.cn, accessed on 2 September 2025
	Population (POP)	1 km	persons/km²	2001–2018	https://www.resdc.cn, accessed on 2 September 2025
	GDP	1 km	10⁴ yuan/km²	2001–2018	https://www.resdc.cn, accessed on 2 September 2025
Dependent Variable	Fire point data	1 km	Binary (0 = No fire, 1 = Fire)	2001–2018	https://www.nasa.gov, accessed on 1 September 2025

Table 2. Common kernel functions.

Function Name	Formula	Meaning
Linear kernel function	$K (x, y) = x^{T} y = x \cdot y$
Sigmoid kernel function	$K (x, y) = t a n h (γ (x, y) + r)$	γ is the coefficient, r is the constant term
Polynomial kernel function	$K (x, y) = {(γ (x, y) + r)}^{d}$	γ is the coefficient, d is the degree, r is the constant term
RBF kernel function	$K (x, y) = e^{- γ} {\|x - y\|}^{2}, γ > 0$	γ is the coefficient

Table 3. Confusion Matrix.

Type	Explanation
True Positive (TP)	True fire point samples are correctly judged by the model
True Negative (TN)	True non-fire point samples are correctly judged by the model
False Positive (FP)	True fire point samples are misjudged by the model
False Negative (FN)	True non-fire point samples are misjudged by the model

Table 4. Drivers for LR.

	VIF < 10
Feature factor	NDVI, GDP, Tmp, RH, Pres, DTR, DTRA, DEM, Slope

Table 5. Standardized coefficients of the LR model.

Variable Name	Estimation Coefficient	Standard Error	Chi-Square Value	Standardized Regression Coefficient
Intercept	0.766	0.035	26.808	−0.086
NDVI	−0.489	0.374	45.235	−1.464
Tmp	0.220	0.007	18.749	1.638
Pres	0.062	0.003	42.694	1.646
DTR	1.706	0.094	28.469	0.611
DTRA	−3.522	0.577	37.194	−0.193
DEM	0.005	0.0004	73.431	1.174
Slope	−0.341	0.028	43.136	−0.489

Table 6. Drivers for the RF classification model and the SVM model.

	Contribution Screening Results
Characteristic factor	NDVI, GDP, POP, WIND, Tmp, RH, Pres, Prec, DTR, DTRA, DEM, Slope

Table 7. Confusion matrix and performance metrics on the test set (n = 2761).

Method	Confusion Matrix (TN/FP/FN/TP)	Accuracy	Precision	Recall	Specificity	F1-Score	AUC
Logistic	943/429/283/1106	0.742	0.720	0.797	0.687	0.756	0.798
RF	1225/147/77/1312	0.919	0.899	0.945	0.893	0.921	0.966
SVM	1101/271/97/1292	0.867	0.827	0.930	0.802	0.875	0.929

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, H.; Zhang, J.; Yang, J.; Teng, C.; Luo, K.; Sun, K. A Forest Fire Risk Prediction Framework Based on Machine Learning Models in the Greater Khingan. Fire 2026, 9, 256. https://doi.org/10.3390/fire9060256

AMA Style

Li H, Zhang J, Yang J, Teng C, Luo K, Sun K. A Forest Fire Risk Prediction Framework Based on Machine Learning Models in the Greater Khingan. Fire. 2026; 9(6):256. https://doi.org/10.3390/fire9060256

Chicago/Turabian Style

Li, Heng, Jialong Zhang, Jingwen Yang, Chenkai Teng, Kai Luo, and Kaiping Sun. 2026. "A Forest Fire Risk Prediction Framework Based on Machine Learning Models in the Greater Khingan" Fire 9, no. 6: 256. https://doi.org/10.3390/fire9060256

APA Style

Li, H., Zhang, J., Yang, J., Teng, C., Luo, K., & Sun, K. (2026). A Forest Fire Risk Prediction Framework Based on Machine Learning Models in the Greater Khingan. Fire, 9(6), 256. https://doi.org/10.3390/fire9060256

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

A Forest Fire Risk Prediction Framework Based on Machine Learning Models in the Greater Khingan

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Materials

2.2.1. Historical Fire Point Datasets

2.2.2. Drivers of Forest Fires

2.3. Methods

2.3.1. Feature Selection Methods

Variance Inflation Factor

Recursive Feature Elimination

2.3.2. Model Construction

Logistic Regression Model

Random Forest Model

Support Vector Machine Model

2.3.3. Interpretation

2.3.4. Performance Evaluation Methods

Confusion Matrix Analysis

The Receiver Operating Characteristic Curve

3. Results

3.1. Analysis of Characteristic Factor Screening Results

3.2. Driving Factor Selection and Importance Analysis Based on RF and SVM

3.3. Accuracy Evaluation

3.4. Thematic Map of Forest Fire Risk Prediction

4. Discussion

4.1. Correlation Characteristics Between Different Driving Factors and Forest Fires

4.2. Comparative Analysis and Applicability of Different Models

4.3. Limitations and Future Perspectives

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI