Modeling Anthropogenic Fire Occurrence in the Boreal Forest of China Using Logistic Regression and Random Forests

Frequent and intense anthropogenic fires present meaningful challenges to forest management in the boreal forest of China. Understanding the underlying drivers of human-caused fire occurrence is crucial for making effective and scientifically-based forest fire management plans. In this study, we applied logistic regression (LR) and Random Forests (RF) to identify important biophysical and anthropogenic factors that help to explain the likelihood of anthropogenic fires in the Chinese boreal forest. Results showed that the anthropogenic fires were more likely to occur at areas close to railways and were significantly influenced by forest types. In addition, distance to settlement and distance to road were identified as important predictors for anthropogenic fire occurrence. The model comparison indicated that RF had greater ability than LR to predict forest fires caused by human activity in the Chinese boreal forest. High fire risk zones in the study area were identified based on RF, where we recommend increasing allocation of fire management resources.


Introduction
Boreal forests (45 • -70 • north latitude) account for more than 25% of the world's forested areas [1], and provide vital natural and economic resources for northern circumpolar countries.In addition, they contain large belowground carbon pools in the form of peat [2,3].Forest fire has been a major disturbance, influencing the energy flows and biogeochemical cycles in boreal forest ecosystems [4][5][6].
Human-caused or anthropogenic fires can be linked to a variety of human activities such as recreation (e.g., camping, hiking, hunting, etc.) and industry such as timber production or railway transportation.It is well-known that these activities play a critical role in fire occurrence in the boreal forest.In Ontario, Canada, on average, two-thirds of all forest fires were caused by humans over the 1976 to 1999 period [7]; while, in the Siberian boreal forest, human-caused fires are responsible for greater than 85% of the total forest fires [8,9].Anthropogenic factors also dominate the fire regimes in the Chinese boreal forest [10].Understanding the primary factors that influence human-caused fire occurrence is crucial and necessary for the allocation of fire prevention and suppression resources and forest management.Human activity and biophysical factors have been found to strongly influence anthropogenic fires.Korovin [11] found that most anthropogenic fires started close to roads.Thus, Forests 2016, 7, 250 2 of 14 fire proximity to roads has been used as a variable in some fire prediction models [12].Niklasson and Granström [13] and Wallenius et al. [14] indicated that expansion of human settlements and increased population density drove the fire occurrence in the boreal forest of northern Europe.Zumbrunnen et al. and Turco et al. [15,16] revealed the importance of weather factors such as temperature and precipitation on fire occurrence.Other factors such as topography (e.g., elevation and slope) and forest type were found to be meaningful drivers [9,[17][18][19][20].Socio-economic indicators, such as unemployment rate and population density, have also been linked to human-caused fire occurrence in many areas [21,22].In the past decade, researchers have attempted to determine the driving factors and the probability of occurrence of human-caused fire in the Chinese boreal forest [23][24][25].However, many quantitative analyses of forest fire drivers have paid less attention to the importance of socio-economic variables or human variables compared to climate-related and topographical variables.In this study we attempt to capture some of this complexity by considering the relationships among biophysical and human factors for human-caused fire occurrence.
Logistic regression (LR) is an approach commonly used to model the influence of different factors on fire occurrence (a binary response variable), and has been used in many studies [19,23,26,27].However, logistic regression has its limitations.For example, the link function of logistic regression utilizes a linear function to regress the logged odds of fire occurrence to a set of independent or predictor variables.The result of LR modeling may also be affected by the flaws in the data such as outliers, multicollinearity among independent variables, and correlated observations.Because nonlinear and complex relationships often exist between biophysical and social variables, logistic regression may not always be sufficient and efficient [28].
Random Forests (RF) is an ensemble learning method based on classification and regression trees (CART).RF can select important variables and calculate the relative importance of each independent variable automatically no matter how many variables are used initially [29].Additionally, RF has been demonstrated to have a high prediction accuracy and high tolerance to outliers and "noise" [30,31].Due to the strengths of Random Forest, fire occurrence studies have begun using this approach in recent years [25,32].
In this study, we use both approaches (LR and RF) to evaluate the potential contribution of biophysical and anthropogenic factors to human-caused fire in the Chinese boreal forest.Each approach was applied to the fire data and the results of the two models were evaluated and compared.Furthermore, the maps of fire occurrence likelihood were created based on the results of the two approaches.

Study Site
China's boreal forest, located in the Daxing'an Mountains of northeastern China (50 • 10 -53 • 33 N and 121 • 12 -127 • 00 E), is the southernmost part of the global boreal forest biome.Its total area covers 8.46 × 10 6 ha (Figure 1).The dominant species is Dahurian larch (Larix gmelinii Rupr.), and is normally accompanied by white birch (Betula platyphylla Suk.) and Mongolian pine (Pinus sylvestris L. var.mongolica Litv.).The Daxing'an Mountains are located in the cold-temperate zone, with a mean annual temperature between −2 • C and 4 • C, and a range extending from −52.3 • C to 39.0 • C. The mean total annual precipitation is between 350-500 mm.
Boreal forest in this region was largely uninhabited until the construction of the first railway across the mountains in the early 20th century [25].Before then, fire ignitions were assumed to have been caused primarily by lightning strikes [25,33].After the introduction of the Reform and Open Policy by the Chinese Government in 1978, China has moved into a period of rapid development, leading to more frequent and intensified human-economic activities in the region of boreal forest.Today, this region has the largest average annual burned area in China and is generally exposed to extremely high fire risk due to the increases in forest-based economic activities.Between 1980 and 2005, there were more than 1000 forest fires, including more than 600 human-caused fires, and a total area of burned forest amounting to 1,300,000 ha [33].In order to circumvent increased forest fire incidence and the costs of damage incurred by forest fires, a series of fire prevention and suppression policies have been issued since 1949 (the foundation of People's Republic of China), and revised after 1987 when the most serious forest fire of the century occurred in the Chinese boreal forest.In recent years, fires have become smaller (burned area), but occur more frequently and intensely than before [34].
Forests 2016, 7, 250 3 of 14 Today, this region has the largest average annual burned area in China and is generally exposed to extremely high fire risk due to the increases in forest-based economic activities.Between 1980 and 2005, there were more than 1000 forest fires, including more than 600 human-caused fires, and a total area of burned forest amounting to 1,300,000 ha [33].In order to circumvent increased forest fire incidence and the costs of damage incurred by forest fires, a series of fire prevention and suppression policies have been issued since 1949 (the foundation of People's Republic of China), and revised after 1987 when the most serious forest fire of the century occurred in the Chinese boreal forest.In recent years, fires have become smaller (burned area), but occur more frequently and intensely than before [34].

Data Collection and Processing
Anthropogenic fire data for the Daxing'an Mountains from 1980 to 2005 were provided by the Forest Fire Prevention Office of Heilongjiang Forestry Bureau, P.R. China, which contained information on: fire location, size, cause, and date of occurrence.In this study, anthropogenic causes of forest fires included smoking, hunting, fireworks, escaped fire from locomotives and residents' homes, but not controlled prescribed burns and other intentional action taken by government or forest management agencies.The data provided by the office were in a geo-database format (ESRI data storage and management framework) and contained geographically referenced point locations of forest fires in the Daxing'an Mountains.Prior to 1990, the fire locations were determined by the fire chief, who identified each fire through a combined approach of fixed observation points in the forest and the Terrain and Forest Instruction Map (1:100,000).After 1990, the fire locations were recorded by Global Position System (GPS).
We created a binary variable (i.e., fire occurrence) for the logistic regression (LR) and Random Forests (RF) models.For each location of the observed fire points (620) the fire occurrence was coded 1 (representing "Yes").Then, we randomly generated non-fire (i.e., control) points in the study area at a ratio of 1:1.5 as the fire ignition number [21,24], resulting in 905 control points where the fire occurrence was coded 0 (representing 'No').We excluded control points located in water bodies or urban areas.
The independent or predictor variables consisted of five categories, including climate, vegetation, topography, infrastructure, and socio-economic factors.Details of these variables are provided in Table 1.The criteria for selecting the independent variables were based on previous studies of fire occurrence.

Data Collection and Processing
Anthropogenic fire data for the Daxing'an Mountains from 1980 to 2005 were provided by the Forest Fire Prevention Office of Heilongjiang Forestry Bureau, P.R. China, which contained information on: fire location, size, cause, and date of occurrence.In this study, anthropogenic causes of forest fires included smoking, hunting, fireworks, escaped fire from locomotives and residents' homes, but not controlled prescribed burns and other intentional action taken by government or forest management agencies.The data provided by the office were in a geo-database format (ESRI data storage and management framework) and contained geographically referenced point locations of forest fires in the Daxing'an Mountains.Prior to 1990, the fire locations were determined by the fire chief, who identified each fire through a combined approach of fixed observation points in the forest and the Terrain and Forest Instruction Map (1:100,000).After 1990, the fire locations were recorded by Global Position System (GPS).
We created a binary variable (i.e., fire occurrence) for the logistic regression (LR) and Random Forests (RF) models.For each location of the observed fire points (620) the fire occurrence was coded 1 (representing "Yes").Then, we randomly generated non-fire (i.e., control) points in the study area at a ratio of 1:1.5 as the fire ignition number [21,24], resulting in 905 control points where the fire occurrence was coded 0 (representing 'No').We excluded control points located in water bodies or urban areas.
The independent or predictor variables consisted of five categories, including climate, vegetation, topography, infrastructure, and socio-economic factors.Details of these variables are provided in Table 1.The criteria for selecting the independent variables were based on previous studies of fire occurrence.

Climate Factors
Daily climate data were extracted from five national weather stations located in the study area.Daily climate data were provided by the China Meteorological Data and Sharing Network [35].The dataset contains three climate factors, including daily mean temperature, daily precipitation, and daily mean relative humidity (Table 1).The corresponding daily climate factors for each fire and control point were retrieved under ArcGIS19.0environment.The daily climate variables for those fire and control points were provided by the meteorological station that was identified as being closest to each point [10].

Vegetation
A digital vegetation map of China with 1 km resolution was downloaded from the Cold and Arid Regions Science Data Center, China [36].We grouped polygons into the following five categories (Figure 2), with relative area of the Daxing'an Mountains reported in parenthesis: needle-leaf deciduous trees (30.6% cover of study area), broad-leaf deciduous trees (12.8%), needle-leaf evergreen trees (11.5%), broad-leaf deciduous shrub (7.45%), grass and agricultural crop (37.7%).The vegetation types for each fire and control point (i.e., non-fire) were extracted from the vegetation map layer using ArcGIS 10.2.We used the proportion of each vegetation type located in a fire or control point to develop the model.[35].The dataset contains three climate factors, including daily mean temperature, daily precipitation, and daily mean relative humidity (Table 1).The corresponding daily climate factors for each fire and control point were retrieved under ArcGIS19.0environment.The daily climate variables for those fire and control points were provided by the meteorological station that was identified as being closest to each point [10].

Vegetation
A digital vegetation map of China with 1 km resolution was downloaded from the Cold and Arid Regions Science Data Center, China [36].We grouped polygons into the following five categories (Figure 2), with relative area of the Daxing'an Mountains reported in parenthesis: needleleaf deciduous trees (30.6% cover of study area), broad-leaf deciduous trees (12.8%), needle-leaf evergreen trees (11.5%), broad-leaf deciduous shrub (7.45%), grass and agricultural crop (37.7%).The vegetation types for each fire and control point (i.e., non-fire) were extracted from the vegetation map layer using ArcGIS 10.2.We used the proportion of each vegetation type located in a fire or control point to develop the model.

Topography
Topographic features affect the spatial patterns of vegetation, plant assemblages, and relative flammability, in addition to influencing local climatic conditions.High resolution (25 m) digital elevation model (DEM) data were collected from the National Administration of Surveying, Mapping and Geo-information of China.The values associated with these DEMs for slope (° degree) and aspect were retrieved (Figure 3).We transformed the DEM-based aspect into cosine aspect (Cos_as) under ArcGIS 19.0 environment by using a trigonometric function so that Cos_as will be close to 1 when the aspect is generally northward, close to −1 when the aspect is southward, and close to 0 when the aspect is eastward or westward [35].

Infrastructure
In the past few decades, the influence of human infrastructure on wildfire has been widely studied [24,32,36,[37][38][39][40].However, certain types of human infrastructure were not considered in previous analyses of wildfire drivers in the Chinese boreal forest, such as the number of fire towers, number of inspection stations (the stations aim to inspect the potential fire ignition sources taken by

Topography
Topographic features affect the spatial patterns of vegetation, plant assemblages, and relative flammability, in addition to influencing local climatic conditions.High resolution (25 m) digital elevation model (DEM) data were collected from the National Administration of Surveying, Mapping and Geo-information of China.The values associated with these DEMs for slope ( • degree) and aspect were retrieved (Figure 3).We transformed the DEM-based aspect into cosine aspect (Cos_as) under ArcGIS 19.0 environment by using a trigonometric function so that Cos_as will be close to 1 when the aspect is generally northward, close to −1 when the aspect is southward, and close to 0 when the aspect is eastward or westward [35].

Infrastructure
In the past few decades, the influence of human infrastructure on wildfire has been widely studied [24,32,[36][37][38][39][40].However, certain types of human infrastructure were not considered in previous analyses of wildfire drivers in the Chinese boreal forest, such as the number of fire towers, number of inspection stations (the stations aim to inspect the potential fire ignition sources taken by people who want to get into the forest during the fire seasons), and length of burned line (the "burned line" is a measure for fire prevention, referring to the practice where forest managers burn ground forest fuel intentionally to decrease the risk of fire occurrence).In this study, we used a number of previously tested variables, but also included several unique, untested variables for fire prevention (Table 1).
Variables such as distance to the nearest railway, distance to the nearest road, and others were retrieved from a 1:250,000 Digital Line Graphic (DLG) map from the National Administration of Surveying, Mapping and Geo-information of China.The distribution of infrastructure is shown in Figure 1.

Socio-Economic Factors
Socio-economic factors included annual funding for forest fire prevention, population density, per capita GDP, and unemployment rates, which were collected from a statistical yearbook [38].These variables have been used in other similar studies to represent trends surrounding potential changes in human activity, which may influence fire occurrence [36,41,42].

Multicollinearity Test
High correlation between independent variables, namely multicollinearity, may exist in a linear regression model, which may distort the model estimation or interfere with accurate estimation.VIF (variance inflation factor) method was used to test for multicollinearity in this study, and variables with significant collinearity (VIF ≥ 10) were gradually removed from the models [24].

Models
(1) Logistic regression (LR).LR describes the relationship between a binary response variable (Y, coded as 0 (representing "No") and 1 (representing "Yes")) and one or more predictor variables (X) by means of a link function.It has been used for fire occurrence prediction and to examine the driving factors of fire occurrence in different regions of the world at various scales [19,40,41].
Logistic regression is commonly expressed as follows: This model relates the probability of fire occurrence P(Y = 1) with p predictor variables, which is a multiple linear regression model such that where η is the link function in generalized linear models, and β 0 -β p are model coefficients to be estimated from data using maximum likelihood method.The logistic function (Equation ( 1)) lies between zero and one and takes on an f S-shaped curve.
(2) Random Forests (RF).RF is an ensemble learning technique for classification, regression, and other tasks.It operates by constructing a number of decision trees, where each tree is generated by bootstrapping samples, and outputs the mode of the classes (classification) or mean prediction (regression) of the individual trees [30].Each decision tree uses two-thirds of the data to train RF, while the remaining one-third of the data (namely out of bag samples (OOB)) are retained for model validation [42].In the modeling process, RF generates variable importance measures by comparing increases in OOB error when that variable is randomly permuted, while keeping all others unchanged [43,44].RF is a nonparametric modeling algorithm which is robust to outliers and over-fitting and it also enables variable importance measures to be computed and compared to other regression techniques [30,45,46].
The original dataset was randomly divided into training (60%) and validation (40%) samples.This process was repeated five times, resulting in five random sub-samples of the data (each one with its own training and validation dataset).We applied LR to the training data of each sub-sample, creating five intermediate models.In order to validate the resultant intermediate models, each was tested with the validation samples (i.e., 40% of the original dataset).The final LR model was built using variables selected from the previous five intermediate models and applied to the whole dataset.The predictor variables that were statistically significant (α = 0.05) in at least three of five intermediate models were included in the final LR model.A backward stepwise selection process was used during LR model fitting.
Because RF is generated via bootstrapping samples, it is not necessary to divide the complete dataset into training and validation parts.However, in order to use an analogous approach to LR, RF was also conducted using each sub-sample with the training datasets.Defining the number of variables to test at each split (mtry) and the number of trees to run (ntree) is required before running RF.Oliveira et al. [32] found that the increase in values of mtry would result in a higher predictive performance.Generally speaking, the parameter mtry was identified using the internal RF function TuneRF, this function computes the optimal number of variables starting from the default (mtry = √ total number of variables for classification) and it searches below and above this threshold for the value with the minimum OOB error rate [47].The ntree parameter was set to 1000 in order to obtain stable results.The most relevant independent variables were selected based on their importance in each sub-sample.Similar to the LR model, the variables that were most relevant in at least three out of five intermediate models were then included in the final model, which was fitted using the complete data set.
An alternative method of quantifying predictive ability for both LR and RF models is receiver operating characteristic (ROC) analysis [48].The ROC curve was obtained by plotting sensitivity versus specificity for various probability thresholds.The area under the curve (AUC) is also often used to evaluate performance [49].An AUC of 0.5 indicates no discrimination, 0.5-0.69poor discrimination, 0.7-0.79reasonable discrimination, 0.8-0.9excellent discrimination [50].In other words, higher AUC indicates better performance of model fitting.
In addition, we determined a probability threshold (i.e., the cut-off value) based on Yueden index [24], which has previously been used to determine the best cut-off values in logistic regression for predicting wildfire occurrence [21, 24,51].The calculation of Yueden index (the best cut-off value) is based on the sensitivity and specificity of ROC "sensitivity + specificity − 1".If the predicted fire occurrence was at or above the cut-off value, the occurrence of fire ignition was considered to have occurred.Otherwise, it was registered as no fire [24,52].
As per mapping the likelihood of fire occurrence, maps showing the fire occurrence likelihood were created using the Kriging method in an ArcGIS environment, and were based on the predicted fire occurrence by both LR and RF models using the whole dataset.

Test for Multicollinearity
Two independent variables, length of road construction and number of fire towers, were removed since the multicollinearity was identified based on the VIF test.The remaining 17 independent variables were selected for logistic regression model fitting.

Identification of Driving Factors by LR and RF Models
Five intermediate models were created based on sub-samples of the dataset.Four variables, including forest type (Forest_type), distance to the railways (Dis_railway), distance to the roads (Dis_road), and distance to the settlements (Dis_sett) were found to be significant in at least one of five sub samples, but only forest type and distance to the railways were significant in more than three sub samples (Table 2).Thus, these two variables were included in subsequent analyses of the final model fitted with the complete dataset.Table 2 also shows the parameter estimation of the final LR model.Forest type and distance to the railways included in the final model were both significant at the significance level α = 0.01.Thus, the final LR model for predicting the probability of fire occurrence P(Y = 1) is The overall prediction accuracy of the final LR model using the whole dataset was 60.8%.For the RF models, the variable importance plots for the five sub-samples were obtained according to the minimum OOB error principle (Figure 4).We utilized the variables that were significant in at least three of the five intermediate models to fit the complete dataset using the RF model.The most important climate variables in the final model according to the values of % IncMSE are shown in Table 3, including distance to the railways, distance to the settlements, forest type, and distance to the roads (in descending order).
The overall prediction accuracy of the final LR model using the whole dataset was 60.8%.For the RF models, the variable importance plots for the five sub-samples were obtained according to the minimum OOB error principle (Figure 4).We utilized the variables that were significant in at least three of the five intermediate models to fit the complete dataset using the RF model.The most important climate variables in the final model according to the values of % IncMSE are shown in Table 3, including distance to the railways, distance to the settlements, forest type, and distance to the roads (in descending order).Accuracy (X-axis), which quantifies the importance of a variable by measuring the change in prediction accuracy when the values of the variable are randomly permuted compared with the original observations.The abbreviated variable names are the same as in Table 1.
The overall prediction accuracy of the final LR model using the whole dataset was 60.8%.For the RF models, the variable importance plots for the five sub-samples were obtained according to the minimum OOB error principle (Figure 4).We utilized the variables that were significant in at least three of the five intermediate models to fit the complete dataset using the RF model.The most important climate variables in the final model according to the values of % IncMSE are shown in Table 3, including distance to the railways, distance to the settlements, forest type, and distance to the roads (in descending order).Accuracy (X-axis), which quantifies the importance of a variable by measuring the change in prediction accuracy when the values of the variable are randomly permuted compared with the original observations.The abbreviated variable names are the same as in Table 1.

Figure 4.
Variable importance measures from Random Forest sub-samples based on Mean Decrease Accuracy (X-axis), which quantifies the importance of a variable by measuring the change in prediction accuracy when the values of the variable are randomly permuted compared with the original observations.The abbreviated variable names are the same as in Table 1.

Comparison of LR and RF Performance
We calculated the prediction accuracy and generated ROC curve of each sub-sample and complete dataset in order to test and compare the predictive ability of LR and RF.Table 4 showed that the correct prediction rate of LR ranged from 53.6%-64.5% for the intermediate models, and 60.8% for the final LR model using the whole dataset (Table 4).In contrast, the prediction accuracy of RF for sub-samples was 66.8%-72.6%,and 70.1% for the final RF model using the whole dataset.The ROC curves (Figure 5) indicated that RF performed better in intermediate models using the sub-sample dataset, as well as the final model using the complete dataset in terms of AUC values.Note: p-value (min and max) in the table represent the minimum and maximum significant level of each variable in the intermediate models; significant samples represents the total number that the variable is tested as significant in the five intermediate models.The four columns on the right of the table show the parameter estimation of selected important variables in the final model that was fitted based on the complete dataset.
The abbreviated variable names are the same as in Table 1.

Comparison of LR and RF Performance
We calculated the prediction accuracy and generated ROC curve of each sub-sample and complete dataset in order to test and compare the predictive ability of LR and RF.Table 4 showed that the correct prediction rate of LR ranged from 53.6%-64.5% for the intermediate models, and 60.8% for the final LR model using the whole dataset (Table 4).In contrast, the prediction accuracy of RF for sub-samples was 66.8%-72.6%,and 70.1% for the final RF model using the whole dataset.The ROC curves (Figure 5) indicated that RF performed better in intermediate models using the subsample dataset, as well as the final model using the complete dataset in terms of AUC values.Note: p-value (min and max) in the table represent the minimum and maximum significant level of each variable in the intermediate models; significant samples represents the total number that the variable is tested as significant in the five intermediate models.The four columns on the right of the table show the parameter estimation of selected important variables in the final model that was fitted based on the complete dataset.The abbreviated variable names are the same as in Table 1.Note that the code is a short variable name used in the models.
According to the LR final model, high fire risk spots were identified at the northern, eastern, and southern study area, respectively (Figure 6).The maps of likelihood of fire occurrence produced by the RF final model, however, provided more defined fire risk for the study area.The high likelihood of fire occurrence was concentrated in the middle and southern portions of the study area (Figure 6).
We also conducted residual analysis to compare the LR and RF final models based on the whole dataset.Figure 7 demonstrated that the RF final model had the best fit (i.e., overall smaller residuals across the study area).In contrast, the overall residual of the LR final model was higher than that of RF and spatially uneven compared to RF.The maximum (0.55) and minimum (−0.55) residuals of RF were both lower than that of LR (0.55 and −0.65, respectively).The positive (under-prediction) and negative (over-prediction) residuals of LR model were also clustered within the study area.The positive residuals were mainly located in the middle and southern portions of the study area, while the negative residuals were concentrated on the northern and northeastern portions of the study area.In comparison, the RF final model had relatively smaller residuals across the entire study area, and a small negative residual was distributed in the north of the study area, and two small positive residual clusters were located in the mid-and southern portions of the study area (Figure 7). the RF final model, however, provided more defined fire risk for the study area.The high likelihood of fire occurrence was concentrated in the middle and southern portions of the study area (Figure 6).We also conducted residual analysis to compare the LR and RF final models based on the whole dataset.Figure 7 demonstrated that the RF final model had the best fit (i.e., overall smaller residuals across the study area).In contrast, the overall residual of the LR final model was higher than that of RF and spatially uneven compared to RF.The maximum (0.55) and minimum (−0.55) residuals of RF were both lower than that of LR (0.55 and −0.65, respectively).The positive (under-prediction) and negative (over-prediction) residuals of LR model were also clustered within the study area.The positive residuals were mainly located in the middle and southern portions of the study area, while the negative residuals were concentrated on the northern and northeastern portions of the study area.In comparison, the RF final model had relatively smaller residuals across the entire study area, and a small negative residual cluster was distributed in the north of the study area, and two small positive residual clusters were located in the mid-and southern portions of the study area (Figure 7).

Discussion
In this study, forest type and distance to the nearest railway were identified as the most important driving factors in both LR and RF models, while distance to the settlement and distance to the roads were identified as useful predictors in the RF models only.
Forest type reflects the fuel conditions, which significantly influences the fire ignition of the study area.Other studies have revealed the importance of transportation corridors to forest fire occurrence, such as railways and roads [20,24,51].In this study, both railways and roads were also the RF final model, however, provided more defined fire risk for the study area.The high likelihood of fire occurrence was concentrated in the middle and southern portions of the study area (Figure 6).We also conducted residual analysis to compare the LR and RF final models based on the whole dataset.Figure 7 demonstrated that the RF final model had the best fit (i.e., overall smaller residuals across the study area).In contrast, the overall residual of the LR final model was higher than that of RF and spatially uneven compared to RF.The maximum (0.55) and minimum (−0.55) residuals of RF were both lower than that of LR (0.55 and −0.65, respectively).The positive (under-prediction) and negative (over-prediction) residuals of LR model were also clustered within the study area.The positive residuals were mainly located in the middle and southern portions of the study area, while the negative residuals were concentrated on the northern and northeastern portions of the study area.In comparison, the RF final model had relatively smaller residuals across the entire study area, and a small negative residual cluster was distributed in the north of the study area, and two small positive residual clusters were located in the mid-and southern portions of the study area (Figure 7).

Discussion
In this study, forest type and distance to the nearest railway were identified as the most important driving factors in both LR and RF models, while distance to the settlement and distance to the roads were identified as useful predictors in the RF models only.
Forest type reflects the fuel conditions, which significantly influences the fire ignition of the study area.Other studies have revealed the importance of transportation corridors to forest fire occurrence, such as railways and roads [20,24,51].In this study, both railways and roads were also

Discussion
In this study, forest type and distance to the nearest railway were identified as the most important driving factors in both LR and RF models, while distance to the settlement and distance to the roads were identified as useful predictors in the RF models only.
Forest type reflects the fuel conditions, which significantly influences the fire ignition of the study area.Other studies have revealed the importance of transportation corridors to forest fire occurrence, such as railways and roads [20,24,51].In this study, both railways and roads were also identified as important drivers on human-caused fires by the RF models.Distance to the settlement is typically described as wildland urban interface (WUI), which appeared to be another driver of anthropogenic fire in the Chinese boreal forest.
Socio-economic factors, such as GDP, unemployment, and population density, did not appear to play a crucial role on local human-caused fire occurrence in either LR or RF models, which is consistent with similar studies [23,24].One possible explanation is that economic and social development in the Chinese boreal forest has been relatively slow, which may not have resulted in significant impacts on anthropogenic forest fire during the past few decades.An additional influence may be China's birth control policy (i.e., one child per family), which has tended to stabilize population increases, and, therefore, would not significantly influence anthropogenic fire occurrence more broadly [25].Similar to socio-economic factors, the importance of climate factors were not identified by the models, demonstrating that human, topographic, and fuel factors dominate fire occurrence in the study The RF final model included predictor variables that were not present in the final LR model, such as distance to the roads and distance to the settlement.The reason for the difference is possibly attributable to the variable selection method used in the LR model.We used a backward stepwise approach to identify which variables were significant.However, Harrell et al. [29] suggested that stepwise variable selection based on the significance level as the criterion for entering a variable, but this would not take into account the problem of multiple comparisons and may influence the selection of real-power variables.
In this study, the RF model seemed to perform better than the LR model with regards to fire prediction accuracy since the RF model had higher predictive capacity for human-caused fire occurrence in the Chinese boreal forest.According to anthropogenic fire likelihood maps based on the RF final model, southern, southeastern, and middle regions of the study area were identified as having higher fire risks.In these high fire risk regions, efficacy and efficiency of fire prevention strategies and use of resources (e.g., fire towers, inspection stations, fire patrols, etc.) might be improved by focusing these in low elevation zones, along railways, and around residential areas during the fire season.

Conclusions
We applied logistic regression and Random Forests to identify which biophysical and human activity factors were important drivers of anthropogenic fire in the Chinese boreal forest.Both methods indicated that forest type and distance to the railways significantly influenced anthropogenic fires.In addition, predictor variables such as distance to the settlement and distance to the roads were also identified by one model as useful factors affecting anthropogenic fire occurrence.Our results revealed that anthropogenic fires were more likely to occur close to infrastructures such as railways, roads, and settlements, and were also significantly influenced by forest types.
Socio-economic factors such as GDP, unemployment, and population density were not identified as important driving factors on anthropogenic fires in the Chinese boreal forest, which may be due to the relatively slower economic development and population increase.
Compared to logistic regression, Random Forest had an increased ability to predict forest fires caused by human activities in the Chinese boreal forest.According to the spatial distribution of fire occurrence likelihood computed by the RF final model, three "hot spots" were identified in the southern, southeastern, and middle regions of the study area.Our findings provide an important form of guidance for local forest fire management in terms of considering fire resource allocation (e.g., fire towers, inspection stations, fire patrols, etc.), which could improve the efficiency of forest fire management in this region of China.

Figure 1 .
Figure 1.Map of study area-Daxing'an Mountain region of northeastern China, showing fire ignition points and human infrastructure.

Figure 1 .
Figure 1.Map of study area-Daxing'an Mountain region of northeastern China, showing fire ignition points and human infrastructure.

Figure 2 .
Figure 2. Distribution of forest vegetation types in the study area, Daxing'an Mountains, China.

Figure 2 .
Figure 2. Distribution of forest vegetation types in the study area, Daxing'an Mountains, China.

Figure 4 .
Figure 4. Variable importance measures from Random Forest sub-samples based on Mean DecreaseAccuracy (X-axis), which quantifies the importance of a variable by measuring the change in prediction accuracy when the values of the variable are randomly permuted compared with the original observations.The abbreviated variable names are the same as in Table1.

Figure 4 .
Figure 4. Variable importance measures from Random Forest sub-samples based on Mean Decrease Accuracy (X-axis), which quantifies the importance of a variable by measuring the change in prediction accuracy when the values of the variable are randomly permuted compared with the original observations.The abbreviated variable names are the same as in Table1.

Figure 5 .
Figure 5. ROC (receiver operating characteristic) curves of each sub-sample and complete dataset using Random Forests (RF) and logistic regression (LR) models.The upper curve has a bigger area under curve (AUC) meaning higher predictive ability.

Figure 5 .
Figure 5. ROC (receiver operating characteristic) curves of each sub-sample and complete dataset using Random Forests (RF) and logistic regression (LR) models.The upper curve has a bigger area under curve (AUC) meaning higher predictive ability.

Figure 6 .
Figure 6.Likelihood of fire occurrence created using the Kriging method with ArcGIS based on predicted values using Logistic Regression (a) and Random Forests (b) models.

Figure 7 .
Figure 7. Spatial distribution of model residuals of logistic regression (a) and Random Forests (b) models.

Figure 6 .
Figure 6.Likelihood of fire occurrence created using the Kriging method with ArcGIS based on predicted values using Logistic Regression (a) and Random Forests (b) models.

Figure 6 .
Figure 6.Likelihood of fire occurrence created using the Kriging method with ArcGIS based on predicted values using Logistic Regression (a) and Random Forests (b) models.

Figure 7 .
Figure 7. Spatial distribution of model residuals of logistic regression (a) and Random Forests (b) models.

Figure 7 .
Figure 7. Spatial distribution of model residuals of logistic regression (a) and Random Forests (b) models.
Daily climate data were extracted from five national weather stations located in the study area.Daily climate data were provided by the China Meteorological Data and Sharing Network

Table 2 .
Comparison of prediction accuracy and goodness of fit between LR and RF models.Forest type and distance to the railways included in the final model were both significant at the significance level α = 0.01.Thus, the final LR model for predicting the probability of fire occurrence P(Y = 1) is Note: LR, logistic regression; RF, Random Forest; Cut-off values were used to determine the prediction accuracy for each intermediate and final model.The complete dataset was not divided into training and validation subsets.

Table 2 .
Comparison of prediction accuracy and goodness of fit between LR and RF models.
Note: LR, logistic regression; RF, Random Forest; Cut-off values were used to determine the prediction accuracy for each intermediate and final model.The complete dataset was not divided into training and validation subsets.
Forest type and distance to the railways included in the final model were both significant at the significance level α = 0.01.Thus, the final LR model for predicting the probability of fire occurrence P(Y = 1) is

Table 2 .
Comparison of prediction accuracy and goodness of fit between LR and RF models.
Note: LR, logistic regression; RF, Random Forest; Cut-off values were used to determine the prediction accuracy for each intermediate and final model.The complete dataset was not divided into training and validation subsets.

Table 3 .
Variables included in the final model using Random Forests, in descending order of importance based on mean decrease in accuracy from the complete data set.

Table 4 .
Variables identified by intermediate models using logistic regression and parameter estimation using the selected variable.

Table 3 .
Variables included in the final model using Random Forests, in descending order of importance based on mean decrease in accuracy from the complete data set.

Table 4 .
Variables identified by intermediate models using logistic regression and parameter estimation using the selected variable.

Table 1 .
Independent or predictor variables included in forest fire model development for Daxing'an Mountains.